[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: parallel autotest revisited
Re: parallel autotest revisited
Tue, 7 Jul 2009 20:33:28 +0000 (UTC)
Ralf Wildenhues <Ralf.Wildenhues <at> gmx.de> writes:
> Thank you very much for working on this!
And thanks for your review. If nothing else, I've learned a lot more about
named pipes through this ordeal.
> > It doesn't reliably fix things for cygwin 1.5 [... but there is hope
> > for 1.7 ... ]
> Thanks for investigating. Do we try to cater to cygwin 1.5 users
> somehow, or just document that they shouldn't try parallel autotest?
Cygwin 1.5 development is dead, but it may be a couple years before it is no
longer an actively used configuration (it is the last version of cygwin that
runs on Windows 98, for example). Part of me is okay with the idea of just
documenting that cygwin 1.5 users should not expect parallel autotest to work.
Or we could add this code, which so far appears to reliably sniff out the
cygwin 1.5 bug without leaving a hung process on either cygwin or on proper
systems (with cygwin, the echo occurs because the exec didn't block):
test "x`(exec 5>fifo; echo oops)& sleep 1; (kill -9 $! ) 2>/dev/null`" = x
But it comes at the penalty of several forks and a non-parallel sleep all
before starting the first test of any parallel test run (but at least we can
factor it so that this test only occurs only if all other prerequisites of -j
I have not yet come up with any less-invasive test that can reliably flush out
cygwin 1.5 fifo mishandling.
The verdict is still out on cgf being able to fix cygwin 1.7 to have named
pipes obey POSIX, but I'm hoping it works out (his latest comment is quite
| Just an update: I now have a headache. Thanks again Microsoft for
| making it all so complicated. This is YA example of a feature (Windows
| Named Pipes) which behave 98% of the way you'd expect it to work. The
| 2% is the killer.
> > Does this patch look correct? Should I go ahead and apply it?
> AFAICS this has a race between the read open and the write open/exec of
> the fifo: the master can spawn 2 test group processes and read open/exec
> the fifo before any of the tests have write opened it.
> This is no problem however: the scheduler should eventually start one of
> the tests, which then write opens the fifo, which allows the master to
The POSIX rules for blocking when opening a named file are symmetric. So the
way I see it, either:
the test group(s) block waiting for the master
the master blocks waiting for at least one test group
Either way, there is no loss in this scheduler race. Both outcomes give the
correct behavior (eventually, enough processes are scheduled so that both read
and write sides are opened, and the block is removed). However, it does mean
that if the scheduler favors the test groups over the master, that it is
possible for ALL of the test groups unblock at once (whereas if the master were
to blindly read open the fifo every iteration of the loop, then the parallel
test groups would be a bit more staggered, because the master would unblock
only the first test group).
> However, if, say, all test groups are killed before any can write open
> the fifo, then the master may hang. This is not very likely, of course:
> typically, the master will have received the same signal (shutdown, or
> terminal hangup, for example), and that will take it out of the hang.
> However, this is a newly introduced, albeit very minor, limitation.
Yes, I agree that there is a narrow window for a race here. But thinking about
it more, the only thing that I could come up with that might make a difference
is to open yet another subshell background task prior to the
$at_job_control_on, with the sole job of write opening the pipe before any of
the test groups are created, and which is reaped later via $! rather than jobs,
so that the master's process group will also own a write fifo fd. On the other
hand, like you say, if a signal has been sent to each process group to kill all
the job-controlled children, it is likely that the same signal will also be
sent to the master. So I don't think it's worth worrying about.
> If OTOH the master never read opens/execs the fifo, then the test group
> processes will hang in their write open of the fifo. This can happen
> with your patch if the test for $at_stop_file is true very early. Also,
> the reading back of the remaining tokens will then try to read from a
> closed fd, leading to:
> | ./testsuite: line 2232: 6: Bad file descriptor
> errors (you can reproduce this by removing `test -f "$at_stop_file" &&').
> A trivial fix for both issues is to move the test for the stop file
> after the block that read opens the fifo. This move doesn't really
> matter for the other use that $at_first has (which, BTW, is anyway a bit
> not well defined in the parallel case, but that's really irrelevant).
I thought about opening the read end sooner when writing my patch. In fact, as
mentioned for item 1, I also thought about always blindly invoking 'exec
AT_JOB_FIFO_FD<"$at_job_fifo"' immediately after spawning each test group
rather than making it conditional on $at_first (which is reached only after n
test groups had already been spawned). The only reason I didn't was because
there was already a check of $at_first that was easy to exploit, and it seemed
a waste of system calls to make the shell repeatedly reopen the same fifo onto
the same fd. But your arguments about moving the open fifo to sooner are good,
so I will gladly squash in your code motion.
> I haven't tested this patch on other systems yet, but hope to get to
> that soon (that AIX box I still have feedback pending from was offline).
> This shouldn't keep you from applying it though, with above fix; the
> fact that we limit ourselves to bash and zsh is a big help here.
You can run the parallel tests with other shells, you just have to explicitly
ask for them: make TESTSUITEFLAGS='TEST_PARALLEL_AUTOTEST=:'.