Re: Hang in srfi-18.test

guile-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Hang in srfi-18.test

From:	Ludovic Courtès
Subject:	Re: Hang in srfi-18.test
Date:	Fri, 24 Oct 2008 16:25:57 +0200
User-agent:	Gnus/5.11 (Gnus v5.11) Emacs/22.3 (gnu/linux)

Hi Neil,

"Neil Jerram" <address@hidden> writes:

> In order for wait-condition-variable to be atomic - e.g. in a race
> where thread A holds (Scheme-level) mutex M, and calls
> (wait-condition-variable C M), and thread B calls (begin (lock-mutex
> M) (signal-condition-variable C)) - it needs to call pthread_cond_wait
> with the same underlying mutex as is involved in the `lock-mutex'
> call.  In terms of the threads.c code, this means that it has to use
> M->lock, not C->lock.
>
> block_self() used its mutex arg for two purposes: for protecting
> access and changes to the wait queue, and for the pthread_cond_wait
> call.  But it wouldn't work reliably to use M->lock to protect C's
> wait queue, because in theory two threads can call
> (wait-condition-variable C M1) and (wait-condition-variable C M2)
> concurrently, with M1 and M2 different.  So we either have to pass
> both C->lock and M->lock into block_self(), or use some other mutex to
> protect the wait queue.  For this patch, I switched to using the
> critical section mutex, because that is a global and so easily
> available.  (If that turns out to be a problem for performance, we
> could make each queue structure have its own mutex, but there's no
> reason to believe yet that it is a problem, because the critical
> section mutex isn't used much overall.)
>
> So then we call block_self() with M->lock, and move where M->lock is
> unlocked to after the block_self() call, instead of before.
>
> That solves the first hang, but introduces a new one, when a SRFI-18
> thread is terminated (`thread-terminate!') between being launched
> (`make-thread') and started (`thread-start!').  The problem now is
> that pthread_cond_wait is a cancellation point (see man
> pthread_cancel), so the pthread_cond_wait call is one of the few
> places where a thread-terminate! call can take effect.  If the thread
> is cancelled at that point, M->lock ends up still being locked, and
> then when do_thread_exit() tries to lock M->lock again, it hangs.
>
> The fix for that is a new `held_mutex' field in scm_i_thread, which is
> set to point to the mutex just before a pthread_cond_(timed)wait call,
> and set to NULL again afterwards.  If on_thread_exit() finds that
> held_mutex is non-NULL, it unlocks that mutex.
>
> A detail is that checking and unlocking held_mutex must be done before
> on_thread_exit() calls scm_i_ensure_signal_delivery_thread(), because
> the innards of scm_i_ensure_signal_delivery_thread() can do another
> pthread_cond_wait() call and so overwrite held_mutex.  But that's OK,
> because it's fine for the mutex check and unlock to happen outside
> Guile mode.

Ouch.  I'm afraid I'm not familiar enough with that code to say anything
meaningful, but I'd be so happy to see that `srfi-18.test' deadlock
vanish that the only thing I can say is: go ahead!  :-)

Thanks for your debugging and report!

Do you think it would be feasible to add tests for the corner cases you
mentioned (e.g., two threads calling `wait-condition-variable' on the
same condition but with a different mutex)?

> Lastly, C->lock is then not needed, so I've removed it.

Nice.

Thank you!

Ludo'.

[Prev in Thread]

Current Thread

[Next in Thread]

Hang in srfi-18.test, Neil Jerram, 2008/10/18
- Re: Hang in srfi-18.test, Neil Jerram, 2008/10/22
  - Re: Hang in srfi-18.test, Neil Jerram, 2008/10/22
    - Re: Hang in srfi-18.test, Ludovic Courtès <=

Prev by Date: goops compilation under vm: not quite.
Next by Date: Re: full moon, vm status update
Previous by thread: Re: Hang in srfi-18.test
Next by thread: notes on the goops dispatch implementation
Index(es):
- Date
- Thread