[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[DotGNU]Hanging problem
From: |
Russell Stuart |
Subject: |
[DotGNU]Hanging problem |
Date: |
07 May 2004 11:52:40 +1000 |
This hanging problem is still with me. I still have no idea what causes
it. I have spent the last few hours in gdb. I am telling what I have
found in the hope you can spot something I haven't.
1. The program has 6 threads - at least that is what I can see in
gdb. I can only guess at what they are from looking at the gdb
backtraces - it is a bit hard to tell as I haven't figured out
how to get a PNet C# backtrace from gdb.
a. The pthread manager thread. I don't know what it does,
but I presume it does not figure in this problem.
b. The PNet GC thread. Ditto.
c. The System.Threading.Timer thread. I re-wrote this class
when I found it had a lot of bugs. The patch is currently
in the savannah patch manager as I haven't got around to
writing tests for it, so you can look at it if you want to.
It is sitting in a Monitor.Wait(Object, int), as it should
be. Ie, it holds no locks. The reason I am fairly sure
this is timer thread is Timer.cs is the only place that does
a Monitor.Wait(), AFAICT.
d. A thread sitting in a WaitHandle.WaitOne(). There is only one
possibility, as there is only one place that does this sort of
call - a background thread of mine that sends packets. Its
code looks roughly like this:
for (;;) {
autoResetEvent.WaitOne();
for (;;) {
lock (this) packet = getPacketOffQueue();
if (packet == null) break;
socket.send(packet);
}
}
So it holds no locks.
e. A thread blocked on a socket read. This is in my code. It
is a background thread that roughly does this:
for (;;) {
lock (this) check for exit;
socket.receive_from(packet, ...);
lock (this) processPacket = this.processPacketDelegate;
if (processPacket != null) processPacket(packet);
}
So it also holds no locks.
f. Finally, we come to the thread that is hung. It is the main
thread, actually. It is sitting in a Monitor.Enter(),
blocked. Given that none of the other threads are holding a
lock this is wrong, obviously.
2. The question that does spring to mind is how can I be sure no
other thread holds a lock on a monitor. Well, nowhere in my code
do I use anything other than "lock (..) ...". Nowhere do I call
Thread.Interrupt() or Thread.Abort(). In other words, there is
nowhere that a Monitor.Enter() can happen without a matching
Monitor.Exit().
3. It now reliably fails on every machine I run it on. Single CPU.
Multi CPU. Hyper-threaded. Various kernels. RH 7.2 and 8.0.
3. In trying to figure out why the Monitor.Enter has blocked, I tried
a few things. Firstly, I altered ilrun to throw an exception when
it blocked, thus giving me a C# back trace. I know know that
thread holds no other locks.
Secondly, with gdb I looked at the internal ilrun structures. This
is what I found:
- My monitor's enterCount was 2. It can only be 2 if there is
another unmatched call to Monitor.Enter(). There aren't any,
as I have shown.
- The monitor->waitHandle->parent.owner is not 0, which would
have to be the case since ILWaitMonitorTryEnter is blocking.
The owning thread is thread (e) above. This makes some small
degree of sense as thread (e) would grab the monitor in
question from time to time as packets are processed.
So what I have now is two independent sources (my enterCount and
you "owner" field) telling me the monitor is currently locked.
Surely this must mean that Monitor.Exit() was not called, or if
is was called it didn't work. One argument against the "didn't"
work theory is that I have two different implementations of
Monitor.Exit() written by two programmers - you and me. And it
fails with both of them.
However, I put a call the the Unix "abort()" function on every
possible route through _IL_Monitor_Exit that did not unlock the
monitor. It was never hit. Ergo I can only conclude that every
call to Monitor.Exit() successfully decremented enterCount and
unlocked the underlying mutex.
So then I decided that perhaps an exception was being thrown
while this object was locked, and somehow the Monitor.Exit()
wasn't being executed. So, I added a "locked object count"
to each thread (the ILExecThread structure, actually). When
an object successfully called _IL_Monitor_InternalTryEnter it
was incremented, and when it successfully called
_IL_Monitor_Exit it was decremented. So it was only 0 when
no locks were held. Then I altered engine/throw.c to contain
this code:
void ILExecThreadSetException(ILExecThread *thread, ILObject *obj)
{
if (thread->lockCount != 0) // @@@
abort(); // @@@
thread->thrownException = obj;
}
The abort() call was never hit. Ergo, an exception was never thrown
while a monitor lock was held, so an exception could not be the
cause of the problem.
I am now at a total loss. I have no idea what I am seeing could be
possible, and can see no way forward. Any ideas?
- [DotGNU]Hanging problem,
Russell Stuart <=