[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: parallel autotest revisited
Re: parallel autotest revisited
Tue, 7 Jul 2009 21:04:26 +0200
* Eric Blake wrote on Tue, Jul 07, 2009 at 03:22:13PM CEST:
> I'm trying to get parallel tests working on cygwin, and the use of <> on a
> fifo is a sticking point. With this patch, things still pass on Linux and
> Solaris, so I believe it is more portable. The idea is that rather than
> relying on the non-POSIX read-write fifo, we instead open a write fifo in
> each test group and a read fifo in the master process. POSIX requires
> opening a unidirectional fifo to block until someone else opens the
> counterpart direction, so we must delay the read fd until after we have
> spawned at least one background task that is blocked on opening a write fd.
Thank you very much for working on this!
> It doesn't reliably fix things for cygwin 1.5 [... but there is hope
> for 1.7 ... ]
Thanks for investigating. Do we try to cater to cygwin 1.5 users
somehow, or just document that they shouldn't try parallel autotest?
> Does this patch look correct? Should I go ahead and apply it?
AFAICS this has a race between the read open and the write open/exec of
the fifo: the master can spawn 2 test group processes and read open/exec
the fifo before any of the tests have write opened it.
This is no problem however: the scheduler should eventually start one of
the tests, which then write opens the fifo, which allows the master to
However, if, say, all test groups are killed before any can write open
the fifo, then the master may hang. This is not very likely, of course:
typically, the master will have received the same signal (shutdown, or
terminal hangup, for example), and that will take it out of the hang.
However, this is a newly introduced, albeit very minor, limitation.
If OTOH the master never read opens/execs the fifo, then the test group
processes will hang in their write open of the fifo. This can happen
with your patch if the test for $at_stop_file is true very early. Also,
the reading back of the remaining tokens will then try to read from a
closed fd, leading to:
| ./testsuite: line 2232: 6: Bad file descriptor
errors (you can reproduce this by removing `test -f "$at_stop_file" &&').
A trivial fix for both issues is to move the test for the stop file
after the block that read opens the fifo. This move doesn't really
matter for the other use that $at_first has (which, BTW, is anyway a bit
not well defined in the parallel case, but that's really irrelevant).
This bug would deserve a test, but unfortunately, since it requires a
race to be won, and also cannot be triggered from inside a test group,
so I don't really know how to write one.
Since the code consuming one token also assumes that AT_JOB_FIFO_FD is a
valid fd, I'd move its open up above that block, too, right after the
$at_job_control_off. That way, you don't rely on the number of jobs
being greater than one. So, in total, please squash in the diff below.
I haven't tested this patch on other systems yet, but hope to get to
that soon (that AIX box I still have feedback pending from was offline).
This shouldn't keep you from applying it though, with above fix; the
fact that we limit ourselves to bash and zsh is a big help here.
I'm seeing a parallel test failure with zsh, BTW, but that happens with
or without your patch. Details in another mail.
diff --git a/lib/autotest/general.m4 b/lib/autotest/general.m4
index 63b0525..e01cfe7 100644
@@ -1370,16 +1370,16 @@ dnl kill -13 $$
echo token >&AT_JOB_FIFO_FD
+ if $at_first; then
+ exec AT_JOB_FIFO_FD<"$at_job_fifo"
shift # Consume one token.
if test address@hidden:@] -gt 0; then :; else
read at_token <&AT_JOB_FIFO_FD || break
set x $[*]
test -f "$at_stop_file" && break
- if $at_first; then
- exec AT_JOB_FIFO_FD<"$at_job_fifo"
# Read back the remaining ($at_jobs - 1) tokens.
set X $at_joblist