[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[patch] parallel make bug with failing commands
From: |
Michael Matz |
Subject: |
[patch] parallel make bug with failing commands |
Date: |
Sun, 31 Jul 2005 05:57:57 +0200 (CEST) |
Hi,
[please keep me CCed, I'm not subscribed to bug-make]
I've noticed the problem with make 3.80 during building GCC. I can
reproduce it with a small makefile, also with current CVS of GNU make.
First I describe the symptoms, and then the bug. The former is a bit
long, so you might skip to the description of the bug, which is obvious
once knowing where to look.
See this Makefile:
----------------------------
.PHONY: all fail1 fail2 fail3 ok1 ok2 ok3
all: fail1 ok1 fail2 ok2 fail3 ok3
fail1 fail2 fail3:
echo Fail
exit 1
ok1 ok2 ok3:
echo Ok
sleep 2
echo ok done
----------------------------
So, we have a mixture of failing and winning commands, where the winning
commands need quite some time to finish. makeing the above in parallel
will result sometimes in make not waiting for all started jobs before
exiting. A multi-CPU machine increases the possibility of this happening.
Higher number for -jN increase it too (I usually can reproduce it just
fine with -j6, i.e. with the max parallelity for this makefile, but others
might have to add more targets).
This is an example of the bug:
% make -r -j5 ; echo "============================="; pp sleep
echo Fail
Fail
exit 1
echo Ok
echo Fail
echo Ok
echo Fail
Ok
sleep 2
Fail
exit 1
make: *** [fail3] Error 1
make: *** Waiting for unfinished jobs....
make: *** [fail1] Error 1
Ok
sleep 2
Fail
exit 1
make: *** [fail2] Error 1
=============================
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
matz 14483 0.0 0.1 7112 736 pts/0 S 06:02 0:00 sleep 2
matz 14485 0.0 0.1 7112 736 pts/0 S 06:02 0:00 sleep 2
Note how even after 'make' stoped there are still two sleeps running on
the system.
The above example is of course harmless. But this also happens if the
commands are submakes, which then hang around without a controling parent
make. And worse, a make can return to the shell (with an error), while
some sub-makes are still building stuff in some directories. If one tries
to work on after the top make returned, one might see confusing effects
from those submakes (e.g. files magically appearing in subdirs, command
output in the terminal, and generally annoying things). Killing all these
sub-makes by hand can be cumbersome if there are many (I have machines
where I can build GCC with parallelity of 32, and something of the above
happened to me. I rather waited some time until the sub-makes where done
on their own, instead of hunting them down).
To demonstrate the above effect with sub-makes involved, just change the
top-level Makefile to:
----------------------------------
.PHONY: all fail1 fail2 fail3 ok1 ok2 ok3
all: fail1 ok1 fail2 ok2 fail3 ok3
ok1 ok2 ok3 fail1 fail2 fail3:
$(MAKE) -C $@
----------------------------------
Where the */Makefile contain the same commands from above appropriately
separated for the ok* and fail* subdirs. An example output would look
like:
% ./make/make/make -r -j6 ; pp sleep
/tmp/par-make/./make/make/make -C fail1
/tmp/par-make/./make/make/make -C ok1
/tmp/par-make/./make/make/make -C fail2
/tmp/par-make/./make/make/make -C ok2
/tmp/par-make/./make/make/make -C fail3
/tmp/par-make/./make/make/make -C ok3
make[1]: Entering directory `/tmp/par-make/ok1'
make[1]: Entering directory `/tmp/par-make/fail2'
make[1]: Entering directory `/tmp/par-make/fail1'
make[1]: Entering directory `/tmp/par-make/ok2'
make[1]: Entering directory `/tmp/par-make/fail3'
make[1]: Entering directory `/tmp/par-make/ok3'
Fail /tmp/par-make/fail2
exit 1
Ok /tmp/par-make/ok3
Ok /tmp/par-make/ok1
Fail /tmp/par-make/fail3
exit 1
Fail /tmp/par-make/fail1
Ok /tmp/par-make/ok2
exit 1
make[1]: *** [all] Error 1
make[1]: Leaving directory `/tmp/par-make/fail2'
make: *** [fail2] Error 2
make: *** Waiting for unfinished jobs....
make[1]: *** [all] Error 1
make[1]: Leaving directory `/tmp/par-make/fail3'
make[1]: *** [all] Error 1
make[1]: Leaving directory `/tmp/par-make/fail1'
make: *** [fail3] Error 2
make: *** [fail1] Error 2
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
matz 9765 0.0 0.0 7120 740 pts/5 S 05:18 0:00 sleep 2
matz 9766 0.0 0.0 7120 740 pts/5 S 05:18 0:00 sleep 2
matz 9769 0.0 0.0 7120 740 pts/5 S 05:18 0:00 sleep 2
address@hidden % Ok /tmp/par-make/ok3 done
make[1]: Leaving directory `/tmp/par-make/ok3'
Ok /tmp/par-make/ok1 done
Ok /tmp/par-make/ok2 done
make[1]: Leaving directory `/tmp/par-make/ok1'
make[1]: Leaving directory `/tmp/par-make/ok2'
Note how the prompt is there already, and after that some output from the
sub-makes working in ok[123] . I spare us the output of running make with
the -d option, what happens is, that make suddenly exits, although there
are still job slots in use.
I know why this happens. The problem is the interaction between die() and
reap_children() when multiple failing jobs are in queue and the user does
not use -k. Let's suppose there are five job slots in use (reflecting all
three failing and two ok jobs). The first failing one will trigger
"reap_children(0, 0)" somewhen, and then the chain of events goes like so:
reap_children (0, /*err= */ 0)
# reap the failing child fail1
# if (!err && child_failed && !keep_going_flag)
# die (2);
die (2)
# this is the first call, hence dying is 0, ergo it does:
# dying = 1
# for (err = (status != 0); job_slots_used > 0; err = 0)
# reap_children (1, err);
# status == 2, hence err will be 1 in the first call
reap_children (1, 1)
# suppose this will get the second failing job, fail2
# if (!err && child_failed && !keep_going_flag)
# die (2);
# as err == 1, this will not call die(2). Instead it set blocks=0
# repeats the loop, and exits it, as no other childs are dead,
# so we return to the above die (2) activation
# We are in this loop again:
# for (err = (status != 0); job_slots_used > 0; err = 0)
# reap_children (1, err);
# right now job_slots_used is 3 (the last fail job, and the two ok jobs)
# this time, the second iteration, i.e. err is now 0, so we do:
reap_children (1, 0)
# We now reap the third failing child, fail3
# err is 0, hence we do this:
# if (!err && child_failed && !keep_going_flag)
# die (2);
die (2)
# as dying is set, we jump over the cleanup
# and just do:
exit (2)
Voila. We don't wait for the two last jobs ok1 and ok2. Note that the
timing here is critical. If in the second reap_children invocation both
remaining fail jobs are done, then they will be reaped by that activation
already, and hence don't lead to a recursive die() call in the last
reap_children() invocation.
The problem is, that the 'err' variable is used to control two things,
namely if the 'Waiting for unfinished jobs....' warning should be printed,
_and_ if die() should be called recursively. As the warning should be
printed only once, 'err' is reset after the first iteration. But that
leads to a recursive invocation of die() which just exits the whole make,
and misses to complete the iteration of the waiting loop in the upper
die() activation.
I used the below patch to fix this problem. It produces no regressions in
the testsuite. It might perhaps be a good idea tp test that
job_slots_used is 0 right before doing the exit() in die(). It would have
catched this bug.
I hope this makes sense.
Ciao,
Michael.
--
Index: job.c
===================================================================
RCS file: /cvsroot/make/make/job.c,v
retrieving revision 1.166
diff -u -p -r1.166 job.c
--- job.c 26 Jun 2005 03:31:30 -0000 1.166
+++ job.c 31 Jul 2005 03:50:43 -0000
@@ -475,9 +475,12 @@ reap_children (int block, int err)
if (err && block)
{
+ static printed = 0;
/* We might block for a while, so let the user know why. */
fflush (stdout);
- error (NILF, _("*** Waiting for unfinished jobs...."));
+ if (!printed)
+ error (NILF, _("*** Waiting for unfinished jobs...."));
+ printed = 1;
}
/* We have one less dead child to reap. As noted in
Index: main.c
===================================================================
RCS file: /cvsroot/make/make/main.c,v
retrieving revision 1.210
diff -u -p -r1.210 main.c
--- main.c 12 Jul 2005 04:35:13 -0000 1.210
+++ main.c 31 Jul 2005 03:50:44 -0000
@@ -2990,7 +2990,7 @@ die (int status)
print_version ();
/* Wait for children to die. */
- for (err = (status != 0); job_slots_used > 0; err = 0)
+ for (err = (status != 0); job_slots_used > 0;)
reap_children (1, err);
/* Let the remote job module clean up its state. */
- [patch] parallel make bug with failing commands,
Michael Matz <=