[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Test-lock hang (not 100% reproducible) on GNU/Linux

From: Pavel Raiskup
Subject: Re: Test-lock hang (not 100% reproducible) on GNU/Linux
Date: Tue, 03 Jan 2017 16:30:32 +0100
User-agent: KMail/5.3.3 (Linux/4.8.15-300.fc25.x86_64; KDE/5.27.0; x86_64; ; )

Hello Berny,

On Monday, January 2, 2017 8:02:03 PM CET Bernhard Voelker wrote:
> On 01/02/2017 05:37 PM, Pavel Raiskup wrote:
> > On Monday, January 2, 2017 4:50:28 PM CET Bruno Haible wrote:
> >> Especially since the problem occurs only on one architecture.
> > 
> > I've been able to reproduce this on i686 in the meantime too, sorry -- I 
> > just
> > reported what I observed :(.  See [1].
> ... or it is related to the KOJI environment?

Maybe, I was able to reproduce this on x86_64 VM, running the build (and
tests) in mock i386 _chroot_ (koji system "cross-compiles" packages in
i386 chroot).

So finally I was able to attach strace and gdb the process (on 8-core
machine), and gdb just confirmed that:

  - the readers wait for main process (busy loop)
  - main process waits for all writers to finish (thread join)
  - writers wait indefinitely for the rwlock released by all readers

> I've seen some of the gnulib tests in the coreutils-testsuite failing on
> several non-x86 archs on the openSUSE build service in the past
> (especially on newer like aarch64).  But at least in the past year, the
> tests on all of i586, x86_64, ppc, ppc64, ppc64le, aarch64 and armv7l
> have been quite stable.

Seems to be different issue.

I am able to reproduce very non-deterministically, under weird conditions.
But to make it a bit more deterministic -- please try the attached patch

  - prolongs the critical section in reader thread
  - to be a bit more fair, the number of concurrent threads is decreased
    to 3 readers (and 3 writers too), so the thread queue is not too long

After at most several iterations, there's only:

  Checker 0x7fc9586b6700 after  check unlock
  Checker 0x7fc9586b6700 before check rdlock
  Checker 0x7fc957eb5700 after  check unlock
  Checker 0x7fc957eb5700 before check rdlock

At least on my box ... is it the same for you?  If yes, is there some
mistake in the patch?  Because otherwise that would just prove that we are
testing behavior which is not guaranteed to happen;  IOW we can't
guarantee that the critical sections _don't_ take always the same (long
enough) time period so there's always one reader with acquired lock.

I am afraid about the explic yield, which doesn't help because (probably?)


Attachment: unfair-sched.patch
Description: Text Data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]