dotgnu-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[DotGNU]Hanging problem


From: Russell Stuart
Subject: [DotGNU]Hanging problem
Date: 07 May 2004 11:52:40 +1000

This hanging problem is still with me.  I still have no idea what causes
it.  I have spent the last few hours in gdb. I am telling what I have
found in the hope you can spot something I haven't.

1.  The program has 6 threads - at least that is what I can see in
    gdb.  I can only guess at what they are from looking at the gdb
    backtraces - it is a bit hard to tell as I haven't figured out
    how to get a PNet C# backtrace from gdb.

      a.  The pthread manager thread.  I don't know what it does,
          but I presume it does not figure in this problem.

      b.  The PNet GC thread.  Ditto.

      c.  The System.Threading.Timer thread.  I re-wrote this class
          when I found it had a lot of bugs.  The patch is currently
          in the savannah patch manager as I haven't got around to
          writing tests for it, so you can look at it if you want to.
          It is sitting in a Monitor.Wait(Object, int), as it should
          be.  Ie, it holds no locks.  The reason I am fairly sure
          this is timer thread is Timer.cs is the only place that does
          a Monitor.Wait(), AFAICT.

      d.  A thread sitting in a WaitHandle.WaitOne().  There is only one
          possibility, as there is only one place that does this sort of
          call - a background thread of mine that sends packets. Its
          code looks roughly like this:
            for (;;) {
              autoResetEvent.WaitOne();
              for (;;) {
                lock (this) packet = getPacketOffQueue();
                if (packet == null) break;
                socket.send(packet);
              }
            }
          So it holds no locks.

      e.  A thread blocked on a socket read.  This is in my code.  It
          is a background thread that roughly does this:
            for (;;) {
               lock (this) check for exit;
               socket.receive_from(packet, ...);
               lock (this) processPacket = this.processPacketDelegate;
               if (processPacket != null) processPacket(packet);
            }
          So it also holds no locks.

      f.  Finally, we come to the thread that is hung.  It is the main
          thread, actually.  It is sitting in a Monitor.Enter(),
          blocked.  Given that none of the other threads are holding a
          lock this is wrong, obviously.

2.  The question that does spring to mind is how can I be sure no
    other thread holds a lock on a monitor.  Well, nowhere in my code 
    do I use anything other than "lock (..) ...".  Nowhere do I call
    Thread.Interrupt() or Thread.Abort().  In other words, there is
    nowhere that a Monitor.Enter() can happen without a matching
    Monitor.Exit().

3.  It now reliably fails on every machine I run it on.  Single CPU.
    Multi CPU.  Hyper-threaded.  Various kernels.  RH 7.2 and 8.0.

3.  In trying to figure out why the Monitor.Enter has blocked, I tried
    a few things.  Firstly, I altered ilrun to throw an exception when
    it blocked, thus giving me a C# back trace.  I know know that
    thread holds no other locks.

    Secondly, with gdb I looked at the internal ilrun structures.  This
    is what I found:

      - My monitor's enterCount was 2.  It can only be 2 if there is
        another unmatched call to Monitor.Enter().  There aren't any,
        as I have shown.

      - The monitor->waitHandle->parent.owner is not 0, which would
        have to be the case since ILWaitMonitorTryEnter is blocking.
        The owning thread is thread (e) above.  This makes some small
        degree of sense as thread (e) would grab the monitor in
        question from time to time as packets are processed.

So what I have now is two independent sources (my enterCount and
you "owner" field) telling me the monitor is currently locked.
Surely this must mean that Monitor.Exit() was not called, or if
is was called it didn't work.  One argument against the "didn't"
work theory is that I have two different implementations of
Monitor.Exit() written by two programmers - you and me.  And it
fails with both of them.

However, I put a call the the Unix "abort()" function on every
possible route through _IL_Monitor_Exit that did not unlock the
monitor.  It was never hit.  Ergo I can only conclude that every
call to Monitor.Exit() successfully decremented enterCount and
unlocked the underlying mutex.

So then I decided that perhaps an exception was being thrown
while this object was locked, and somehow the Monitor.Exit()
wasn't being executed.  So, I added a "locked object count"
to each thread (the ILExecThread structure, actually).  When
an object successfully called _IL_Monitor_InternalTryEnter it
was incremented, and when it successfully called
_IL_Monitor_Exit it was decremented.  So it was only 0 when
no locks were held.  Then I altered engine/throw.c to contain
this code:
  void ILExecThreadSetException(ILExecThread *thread, ILObject *obj)
  {
    if (thread->lockCount != 0)  // @@@
      abort(); // @@@
    thread->thrownException = obj;
  }

The abort() call was never hit.  Ergo, an exception was never thrown
while a monitor lock was held, so an exception could not be the
cause of the problem.

I am now at a total loss.  I have no idea what I am seeing could be
possible, and can see no way forward.  Any ideas?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]