[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
GNU Parallel Bug Reports Job failure semantics
From: |
Alastair Andrew |
Subject: |
GNU Parallel Bug Reports Job failure semantics |
Date: |
Tue, 13 Sep 2011 00:45:03 +0100 |
Hi,
I've been using GNU parallel for a while now to distribute jobs across about 30
machines in my department's lab. I've set up the .parallel/sshloginfile to list
all the machines and their sizes so I can just use the -S .. flag to spread the
load. I tend to use the strictest error handling but I find it a bit
overzealous especially with regard to ssh failures. Often one or two machines
will be offline for maintenance (or undergrads will have unwittingly switched
them off); when GNU parallel tries to login to one of these machines it won't
be able to and flags this up as a failure (thus terminating all my jobs).
Currently I see two options: keep the .parallel/sshloginfile synced with the
currently accessible machines, or choose a large enough retry limit that this
problem won't be encountered as parallel tries to compensate. Neither seems
perfect. I think it would be better if GNU parallel didn't regard its ssh
failure as an error. After all it's not the user's task that has failed; the
job hasn't started, GNU parallel failed to distribute it. This would allow
users to specify a static pool of machines without worrying too much whether a
few were down. Obviously in a worst case scenario maybe the majority of
machines are unreachable so only a few are actually doing all the work. In that
case maybe there should be a threshold where parallel informs the user.
Anyway, I don't know what anyone else's option on the matter is I just thought
it might simplify the process for users.
Cheers,
Alastair
---------------------------------------------------------
Alastair Andrew,
address@hidden
Department of Computer and Information Sciences,
University of Strathclyde.
Tel: 0141 548 3138 Fax: 0141 548 4523
The University of Strathclyde is a charitable body, registered in Scotland,
with registration number SC015263.
- GNU Parallel Bug Reports Job failure semantics,
Alastair Andrew <=