guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Hang in srfi-18.test


From: Neil Jerram
Subject: Re: Hang in srfi-18.test
Date: Wed, 22 Oct 2008 19:14:09 +0100

2008/10/22 Neil Jerram <address@hidden>:
> 2008/10/18 Neil Jerram <address@hidden>:
>> Just a heads up...
>>
>> I'm seeing a hang in srfi-18.test, when running make check in master,
>> in the "exception handler installation is thread-safe" test.  It's not
>> 100% reproducible, so looks like there's a race involved.
>>
>> I think the problem is that wait-condition-variable is not actually
>> atomic in the way that it is supposed to be.  It unlocks the mutex,
>> then starts waiting on the cond var.  So it is possible for another
>> thread to lock the same mutex, and signal the cond var, before the
>> wait-condition-variable thread starts waiting.
>>
>> I don't currently have a solution, but I'm working on it...
>
> The attached patch seems to work.  I'll write more to explain later!

So:

In order for wait-condition-variable to be atomic - e.g. in a race
where thread A holds (Scheme-level) mutex M, and calls
(wait-condition-variable C M), and thread B calls (begin (lock-mutex
M) (signal-condition-variable C)) - it needs to call pthread_cond_wait
with the same underlying mutex as is involved in the `lock-mutex'
call.  In terms of the threads.c code, this means that it has to use
M->lock, not C->lock.

block_self() used its mutex arg for two purposes: for protecting
access and changes to the wait queue, and for the pthread_cond_wait
call.  But it wouldn't work reliably to use M->lock to protect C's
wait queue, because in theory two threads can call
(wait-condition-variable C M1) and (wait-condition-variable C M2)
concurrently, with M1 and M2 different.  So we either have to pass
both C->lock and M->lock into block_self(), or use some other mutex to
protect the wait queue.  For this patch, I switched to using the
critical section mutex, because that is a global and so easily
available.  (If that turns out to be a problem for performance, we
could make each queue structure have its own mutex, but there's no
reason to believe yet that it is a problem, because the critical
section mutex isn't used much overall.)

So then we call block_self() with M->lock, and move where M->lock is
unlocked to after the block_self() call, instead of before.

That solves the first hang, but introduces a new one, when a SRFI-18
thread is terminated (`thread-terminate!') between being launched
(`make-thread') and started (`thread-start!').  The problem now is
that pthread_cond_wait is a cancellation point (see man
pthread_cancel), so the pthread_cond_wait call is one of the few
places where a thread-terminate! call can take effect.  If the thread
is cancelled at that point, M->lock ends up still being locked, and
then when do_thread_exit() tries to lock M->lock again, it hangs.

The fix for that is a new `held_mutex' field in scm_i_thread, which is
set to point to the mutex just before a pthread_cond_(timed)wait call,
and set to NULL again afterwards.  If on_thread_exit() finds that
held_mutex is non-NULL, it unlocks that mutex.

A detail is that checking and unlocking held_mutex must be done before
on_thread_exit() calls scm_i_ensure_signal_delivery_thread(), because
the innards of scm_i_ensure_signal_delivery_thread() can do another
pthread_cond_wait() call and so overwrite held_mutex.  But that's OK,
because it's fine for the mutex check and unlock to happen outside
Guile mode.

Lastly, C->lock is then not needed, so I've removed it.

Any thoughts or comments?  (In case not, I guess I'll commit in a
couple of days; let me know if I should wait longer.)

     Neil




reply via email to

[Prev in Thread] Current Thread [Next in Thread]