[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] parallel-tests merged
From: |
Zack Weinberg |
Subject: |
Re: [Monotone-devel] parallel-tests merged |
Date: |
Fri, 17 Aug 2007 17:56:36 -0700 |
On 8/17/07, William Uther <address@hidden> wrote:
> Hi,
> This is failing miserably on my mac (MacOS X 10.4, Intel mac). I
> first tried -j4, then I scaled back to just make check. The results
> below are for the straight "make check".
[...]
> 1 _unit_tester_fail_check FAIL (gobbledygook:
> Check failed (return value): wanted 0 got -126
[...]
OH NO NOT THIS AGAIN.
This is a heisenbug of the very worst kind, which I saw during my own
testing but thought I had eliminated the provocation for.
Here's the deal. The per-test child process is expected to write a
detailed human-readable log of its operations to a file "tester.log"
in the per-test directory. It is also supposed to write a one-line
machine-parseable summary of the overall state of the test (passed,
failed, skipped, etc) to a file "STATUS" in the per-test directory.
The parent process interprets the STATUS file to accumulate statistics
about the run and print the success or failure messages. (It is
necessary to do this dance because _exit() only passes back seven bits
of information to the parent.) At the Lua level, there are two
different file handles - "test.log" and "s" (see run_one_test in
testlib.lua). Lua file handles are a thin wrapper around C stdio
FILEs.
When this bug hits, a block of text that is supposed to go to
tester.log winds up in STATUS instead, and the text that is supposed
to go to STATUS disappears into a black hole. The code that reads
STATUS is, defensively, interpreting that as a failure.
A previous version of the code - never checked into the repository -
showed this bug about one time in four - only with ./run_unit_tests,
not always the same test cases, and never under strace. I *thought*
that it was a problem with swapping out file descriptors 0, 1, and 2
behind stdio's back, so I took all the code that did that out and made
the log file be a separate file instead of the child's stdout/err.
That made the problem go away for me. However, I never proved
conclusively what was causing it, and if you're seeing it with the
checked-in code, there must be something else wrong.
I am at my wits' end with this bug. There are a couple other things
that could be tried - for instance, passing the text for STATUS back
from run_one_test to the C++ layer and writing it out there - but
without knowing where the problem comes from, we're just flailing
around in the dark. (Did I mention the problem disappears under
strace? Or if I stick in debugging printfs?)
zw