Re: possible issues with --retries option

bug-parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: possible issues with --retries option

From:	Ole Tange
Subject:	Re: possible issues with --retries option
Date:	Sun, 19 Dec 2010 16:33:02 +0100

On Thu, Dec 16, 2010 at 2:21 AM, Nick Felt <address@hidden> wrote:
> Hello,
> I've been having some difficulty trying to get parallel to work consistently
> in distributing jobs across a bunch of remote machines on the local network.
>  My goal is to have parallel run a script remotely that exits with error
> code 1 if the machine is occupied, prompting parallel (with --retries set)
> to run the job on a different machine.

This is exactly the intended behaviour of GNU Parallel: If the job has
failed N times on computer A, but only N-1 times on computer B then
the jobs should be scheduled on computer B.

> I'm not sure if the behavior I'm
> seeing is actually a bug, or intentional for reasons I don't understand.
>  But either way, advice would be appreciated!
> The main problem is that when I run this command (machine names are "nutmeg"
> and "vinegar" as remotes):
> $ seq 1 8 | parallel --retries 10 --sshlogin 8/nutmeg,8/vinegar -j+0
> "hostname; false"

This is a bug. -D shows GNU Parallel only schedules the job once on
computer A. And it clearly should try 10 times on both computers.

This fails, too:
seq 1 8 | parallel --retries 2 --sshlogin 8/iris,8/: -j+0 "hostname; false"
seq 1 8 | parallel --retries 2 --sshlogin 8/iris,8/: -j+1 "hostname; false"
seq 1 2 | parallel --retries 2 --sshlogin 8/iris,8/: -j-1 "hostname; false"
seq 1 1 | parallel --retries 2 --sshlogin 1/iris,1/: -j1 "hostname; false"
seq 1 1 | parallel --retries 2 --sshlogin 1/iris,1/: -j9 "hostname; false"
seq 1 1 | parallel --retries 2 --sshlogin 1/iris,1/: -j0 "hostname; false"
seq 1 1 | parallel --retries 2 --sshlogin 1/iris,1/: -j-1 "hostname; false"
seq 1 8 | parallel --retries 2 --sshlogin 1/iris,9/: -j-1 "hostname; false"

This works:
seq 1 8 | parallel --retries 2 --sshlogin 8/iris,8/: -j-1 "hostname; false"
seq 1 1 | parallel --retries 2 --sshlogin 1/iris,1/:  "hostname; false"
seq 1 4 | parallel --retries 2 --sshlogin 2/iris,2/: -j-1 "hostname; false"
seq 1 4 | parallel --retries 2 --sshlogin 2/iris,2/: -j1 "hostname; false"
seq 1 4 | parallel --retries 2 --sshlogin 1/iris,1/: -j1 "hostname; false"
seq 1 2 | parallel --retries 2 --sshlogin 1/iris,1/: -j1 "hostname; false"

The leads me the think the bug only happens if there are fewer jobs
than there are job slots on the second computer.
It is caused by processes_available_by_system_limit returning 0
because there are not any jobs left on stdin. This is correct unless
--retry is set. in which case we should count from 0 again.

If --retries = 1 it works as expected.

> I get no output.  This kind of makes sense, because parallel presumably
> retries each of the 8 jobs 10 times, always encounters and error, and gives
> up (albeit silently).

I believe it would make sense to give the output when GNU Parallel
finally gives up.

> That also makes sense.  What I'm having trouble understanding is why two
> other things also make it work: (c) removing the '-j+0' setting, and (d) -
> most perplexingly - changing the input to be 9 lines (or equivalently,
> reducing the 'ncpu' value for vinegar to 7):
> $ seq 1 8 | parallel --retries 10 --sshlogin 8/nutmeg,8/vinegar "hostname;
> false"

As this is working as expected I am not going bug hunting here. At
most I will see this as a sanity check.

> $ seq 1 9 | parallel --retries 10 --sshlogin 8/nutmeg,8/vinegar -j+0
> "hostname; false"

As this is working as expected I am not going bug hunting here. At
most I will see this as a sanity check.

> Furthermore, when I increase the input to 16 lines, I get an even mix of
> "nutmeg" and "vinegar" (9 lines always seems to produce "nutmeg" only) and

9 is caused by the default value for -j (see man page).

> it also seems to print out faster:

And that makes sense, too: GNU Parallel will schedule 8 tasks on
nutmeg and 8 tasks on vinegar. When they fail they will swap computer.
If you only scheduled 8 tasks they would all go onto one of the
computers. Then they have to fail before moving to the other. So the
total time for running 16 jobs should be comparable to running 8.

> $ seq 1 16 | parallel --retries 10 --sshlogin 8/nutmeg,8/vinegar -j+0
> "hostname; false"

This too works as expected.

> I don't know how the --retries option works internally, but I'd hazard a
> guess that it's somehow responsible for the variance I'm seeing.  Could
> someone what's going on here (and whether it's supposed to be working like
> this)?

I believe your first example shows a bug. Thank you for documenting it
so clearly. It made it fairly easy to hunt down. The fix is now in the
git version

For machine occupation you may want to check out --load (available in
the git version) and give that a spin.


/Ole

[Prev in Thread]

Current Thread

[Next in Thread]

possible issues with --retries option, Nick Felt, 2010/12/17
- Re: possible issues with --retries option, Ole Tange, 2010/12/19
- Re: possible issues with --retries option, Ole Tange <=

Prev by Date: Re: possible issues with --retries option
Next by Date: GNU Parallel 20101222 released
Previous by thread: Re: possible issues with --retries option
Next by thread: GNU Parallel 20101222 released
Index(es):
- Date
- Thread