bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

bug-guix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27

From:	Mark H Weaver
Subject:	bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Date:	Thu, 05 Jul 2018 06:05:59 -0400
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Hi,

Andy Wingo <address@hidden> writes:

> On Thu 05 Jul 2018 05:33, Mark H Weaver <address@hidden> writes:
>
>>> One problem I’ve noticed is that the child process that
>>> ‘call-with-decompressed-port’ spawns would be stuck trying to get the
>>> allocation lock:
>>>
>>> So it seems quite clear that the thing has the alloc lock taken.  I
>>> suppose this can happen if one of the libgc threads runs right when we
>>> call fork and takes the alloc lock, right?
>>
>> Does libgc spawn threads that run concurrently with user threads?  If
>> so, that would be news to me.  My understanding was that incremental
>> marking occurs within GC allocation calls, and marking threads are only
>> spawned after all user threads have been stopped, but I could be wrong.
>
> I think Mark is correct.

Actually, looking at the libgc code more closely, it seems that Ludovic
is correct.  GC_init calls GC_start_mark_threads_inner if PARALLEL_MARK
is defined at compile time.

However, it's also the case that libgc uses 'pthread_atfork' (where
available) to arrange to grab the GC allocation as well as the "mark
locks" in the case where parallel marking is enabled.  See
fork_prepare_proc, fork_parent_proc, and fork_child_proc in
pthread_support.c.

As the libgc developers admit in the comment above 'fork_prepare_proc',
they are not strictly meeting the requirements for safe use of 'fork',
but they _are_ grabbing the allocation lock during 'fork', at least on
systems that support 'pthread_atfork'.

It looks like setting the GC_MARKERS environment variable to 1 should
result in 'available_markers_m1' being set to 0, and thus effectively
disable parallel marking.  In that case, if I understand the code
correctly, no marker threads will be spawned.

It would be interesting to see if this problem can be reproduced when
GC_MARKERS is set to 1.

>> The first idea that comes to my mind is that perhaps the finalization
>> thread is holding the GC allocation lock when 'fork' is called.
>
> So of course we agree you're only supposed to "fork" when there are no
> other threads running, I think.
>
> As far as the finalizer thread goes, "primitive-fork" calls
> "scm_i_finalizer_pre_fork" which should join the finalizer thread,
> before the fork.  There could be a bug obviously but the intention is
> for Guile to shut down its internal threads.

Ah, good!  I didn't know this.  So, I guess the problem is most likely
elsewhere.

We should probably arrange to join the signal delivery thread at the
same time, and then to respawn it in the parent and child if needed.
What do you think?

> Here's the body of primitive-fork fwiw:
>
>     {
>       int pid;
>       scm_i_finalizer_pre_fork ();
>       if (scm_ilength (scm_all_threads ()) != 1)

I think there's a race here.  I think it's possible for the finalizer
thread to be respawned after 'scm_i_finalizer_pre_fork' in two different
ways:

(1) 'scm_all_threads' performs allocation, which could trigger GC.

(2) another thread could perform heap allocation and trigger GC after
    'scm_i_finalizer_pre_fork' joins the thread.  it might then shut
    down before 'scm_all_thread' is called.

However, these are highly unlikely scenarios, and most likely not the
problem we're seeing here.

Still, I think the 'scm_i_finalizer_pre_fork' should be moved after the
'if', to avoid this race.

>> Another possibility: both the finalization thread and the signal
>> delivery thread call 'scm_without_guile', which calls 'GC_do_blocking',
>> which also temporarily grabs the GC allocation lock before calling the
>> specified function.  See 'GC_do_blocking_inner' in pthread_support.c in
>> libgc.  You spawn the signal delivery thread by calling 'sigaction' and
>> you make work for it to do every second when the SIGALRM is delivered.
>
> The signal thread is a possibility though in that case you'd get a
> warning; the signal-handling thread appears in scm_all_threads.  Do you
> see a warning?  If you do, that is a problem :)

Good point!

>>> If that is correct, the fix would be to call fork within
>>> ‘GC_call_with_alloc_lock’.
>>>
>>> How does that sound?
>>
>> Sure, sounds good to me.
>
> I don't think this is necessary.  I think the problem is that other
> threads are running.  If we solve that, then we solve this issue; if we
> don't solve that, we don't know what else those threads are doing, so we
> don't know what mutexes and other state they might have.

On second thought, I agree with Andy here.  Grabbing the allocation lock
shouldn't be needed, and moreover it's not sufficient.  I think we
really need to ensure that no other threads are running, in which case
grabbing the allocation lock is pointless.  Anyway, it seems that libgc
is already arranging to do this, at least on modern GNU/Linux systems.

     Thanks,
       Mark

[Prev in Thread]

Current Thread

[Next in Thread]

bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Ludovic Courtès, 2018/07/04
- bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Ludovic Courtès, 2018/07/04
  - bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Mark H Weaver, 2018/07/04
    - bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Andy Wingo, 2018/07/05
    - bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Mark H Weaver <=
    - bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Andy Wingo, 2018/07/05
    - bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Ludovic Courtès, 2018/07/05
    - bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Andy Wingo, 2018/07/05
    - bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Ludovic Courtès, 2018/07/06
    - bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27, Ludovic Courtès, 2018/07/05

Prev by Date: bug#30312: documentation: misleading EFI partitioning instructions
Next by Date: bug#32058: mysql build fails on d88b29d6b78482cdb05ac714984f6a27195e3d37
Previous by thread: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Next by thread: bug#31925: 'guix substitutes' sometimes hangs on glibc 2.27
Index(es):
- Date
- Thread