Re: GNU Parallel Bug Reports Job failure semantics

bug-parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Parallel Bug Reports Job failure semantics

From:	Ole Tange
Subject:	Re: GNU Parallel Bug Reports Job failure semantics
Date:	Tue, 13 Sep 2011 12:43:05 +0200

On Tue, Sep 13, 2011 at 1:45 AM, Alastair Andrew
<address@hidden> wrote:
> Hi,
>
> I've been using GNU parallel for a while now to distribute jobs across about 
> 30 machines in my department's lab. I've set up the .parallel/sshloginfile to 
> list all the machines and their sizes so I can just use the -S .. flag to 
> spread the load. I tend to use the strictest error handling but I find it a 
> bit overzealous especially with regard to ssh failures. Often one or two 
> machines will be offline for maintenance (or undergrads will have unwittingly 
> switched them off); when GNU parallel tries to login to one of these machines 
> it won't be able to and flags this up as a failure (thus terminating all my 
> jobs).

This is the situation --retries is made for. You should set --retries
to (number of offline computers +1).

> Currently I see two options: keep the .parallel/sshloginfile synced with the 
> currently accessible machines, or choose a large enough retry limit that this 
> problem won't be encountered as parallel tries to compensate. Neither seems 
> perfect. I think it would be better if GNU parallel didn't regard its ssh 
> failure as an error. After all it's not the user's task that has failed; the 
> job hasn't started, GNU parallel failed to distribute it. This would allow 
> users to specify a static pool of machines without worrying too much whether 
> a few were down. Obviously in a worst case scenario maybe the majority of 
> machines are unreachable so only a few are actually doing all the work. In 
> that case maybe there should be a threshold where parallel informs the user.

That is an interesting idea.

But currently I do not see a way of implementing it: How will you tell
the difference between between ssh failing or ssh running a command
that is failing?

  ssh nonexisting.example.com echo foo

fails with error code 255 because ssh fails.

  ssh localhost exit 255

fails with error code 255 because ssh succeeds but the command fails
with error code 255.

How will you tell the difference?

We could implement a probe to test which machines were up when
starting. That way you only would have problems with hosts that went
offline after starting:

  cat .ssh/pre_sshloginfile | parallel -j0 ssh server.example.com echo
server.example.com > .ssh/sshloginfile

But these would take time and slow down the first connection.

Before I implement something like that it would be good if you could
try the above probe out for a while and see how well that works.

/Ole

[Prev in Thread]

Current Thread

[Next in Thread]

GNU Parallel Bug Reports Job failure semantics, Alastair Andrew, 2011/09/12
- Re: GNU Parallel Bug Reports Job failure semantics, Ole Tange <=
  - Re: GNU Parallel Bug Reports Job failure semantics, Andreas Bernauer, 2011/09/14
    - Re: GNU Parallel Bug Reports Job failure semantics, Ole Tange, 2011/09/13

Prev by Date: GNU Parallel Bug Reports Job failure semantics
Next by Date: GNU Parallel Bug Reports csh fix not working for me
Previous by thread: GNU Parallel Bug Reports Job failure semantics
Next by thread: Re: GNU Parallel Bug Reports Job failure semantics
Index(es):
- Date
- Thread