[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Hang in srfi-18.test
Re: Hang in srfi-18.test
Fri, 24 Oct 2008 16:25:57 +0200
Gnus/5.11 (Gnus v5.11) Emacs/22.3 (gnu/linux)
"Neil Jerram" <address@hidden> writes:
> In order for wait-condition-variable to be atomic - e.g. in a race
> where thread A holds (Scheme-level) mutex M, and calls
> (wait-condition-variable C M), and thread B calls (begin (lock-mutex
> M) (signal-condition-variable C)) - it needs to call pthread_cond_wait
> with the same underlying mutex as is involved in the `lock-mutex'
> call. In terms of the threads.c code, this means that it has to use
> M->lock, not C->lock.
> block_self() used its mutex arg for two purposes: for protecting
> access and changes to the wait queue, and for the pthread_cond_wait
> call. But it wouldn't work reliably to use M->lock to protect C's
> wait queue, because in theory two threads can call
> (wait-condition-variable C M1) and (wait-condition-variable C M2)
> concurrently, with M1 and M2 different. So we either have to pass
> both C->lock and M->lock into block_self(), or use some other mutex to
> protect the wait queue. For this patch, I switched to using the
> critical section mutex, because that is a global and so easily
> available. (If that turns out to be a problem for performance, we
> could make each queue structure have its own mutex, but there's no
> reason to believe yet that it is a problem, because the critical
> section mutex isn't used much overall.)
> So then we call block_self() with M->lock, and move where M->lock is
> unlocked to after the block_self() call, instead of before.
> That solves the first hang, but introduces a new one, when a SRFI-18
> thread is terminated (`thread-terminate!') between being launched
> (`make-thread') and started (`thread-start!'). The problem now is
> that pthread_cond_wait is a cancellation point (see man
> pthread_cancel), so the pthread_cond_wait call is one of the few
> places where a thread-terminate! call can take effect. If the thread
> is cancelled at that point, M->lock ends up still being locked, and
> then when do_thread_exit() tries to lock M->lock again, it hangs.
> The fix for that is a new `held_mutex' field in scm_i_thread, which is
> set to point to the mutex just before a pthread_cond_(timed)wait call,
> and set to NULL again afterwards. If on_thread_exit() finds that
> held_mutex is non-NULL, it unlocks that mutex.
> A detail is that checking and unlocking held_mutex must be done before
> on_thread_exit() calls scm_i_ensure_signal_delivery_thread(), because
> the innards of scm_i_ensure_signal_delivery_thread() can do another
> pthread_cond_wait() call and so overwrite held_mutex. But that's OK,
> because it's fine for the mutex check and unlock to happen outside
> Guile mode.
Ouch. I'm afraid I'm not familiar enough with that code to say anything
meaningful, but I'd be so happy to see that `srfi-18.test' deadlock
vanish that the only thing I can say is: go ahead! :-)
Thanks for your debugging and report!
Do you think it would be feasible to add tests for the corner cases you
mentioned (e.g., two threads calling `wait-condition-variable' on the
same condition but with a different mutex)?
> Lastly, C->lock is then not needed, so I've removed it.