parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU parallel - resumable jobs


From: Ole Tange
Subject: Re: GNU parallel - resumable jobs
Date: Mon, 12 Dec 2011 23:07:37 +0100

On Mon, Dec 12, 2011 at 4:42 PM, rambach <rambachr@yahoo.com> wrote:
> Hello,
>
> i often have long running jobs, which take several days.
> sometimes i have to interrupt those jobs and resume them later (i.e. reboot)
>
> in the past i wrote custom wrappers for each of these jobs, but am now
> looking into using GNU Parallel instead.
> my idea was to use an intermediate job-flow program which would save the
> current line number processed, as in
>
> jobflow.pl:
>
> #!/usr/bin/env perl
>
> use strict;
> use warnings;
> use File::Slurp;
> use IO::Handle;
>
> STDOUT->autoflush(1);
>
> my $skip = 0;
> my $statefile = undef;
>
> while(@ARGV) {
>    my $arg = shift;
>    if($arg eq "--skip") {
>        $skip = shift;
>        die unless defined $skip;
>    } elsif($arg eq "--statefile") {
>        $statefile = shift;
>        die unless defined $statefile;
>    } elsif($arg eq "--resume") {
>        die "resume needs --statefile set" unless defined $statefile;
>        $skip = read_file $statefile if (-e $statefile);
>    }
> }
>
> my $n = 0;
>
> while(<>) {
>    if($skip) {
>        $skip--;
>    } else {
>        print;
>        write_file $statefile, $n if defined $statefile;
>    }
>    $n++;
> }
>
>
> print-and-sleep.sh:
> #!/bin/sh
> echo $1
> sleep 1
>
> seq 100 | ./jobflow.pl --statefile /tmp/jobstate --resume | parallel
> ./print-and-sleep.sh
>
>
> however this does not work as expected because of the buffering of the pipe.
> i.e. after hitting CTRL-C after the first 10 lines printed, and running the
> same line again, it would already save "100" to the statefile, thus not
> continuing with 11 after another invocation of the same command line.
>
> from what i can say, this functionality had to be built into the program
> which actually launches the worker process, in this case GNU Parallel.
>
> i'd be interested to hear your opinion on this matter; and how one could
> approach it in the most elegant way.
>
> best regards,
> roland rambach

So you basically have a long list of jobs (either in a file or
generated by parallel self):

job1
job2
:
job_k
:
job_n

job1-job_k finishes, and job_k+1 and some more jobs are running when
the machine suddenly reboots. After the reboot you would like an easy
way to skip job1-job_k and immediately start job_k+1.

I am thinking of re-using --joblog and adding --resume. I see at least
2 approaches:

* Only look for the job-number. Here parallel will simply look for the
seq (first column in --joblog) and skip the jobs that have a seq that
is already in the joblog. It cannot just look for the max job number,
as some early jobs may take longer than some later jobs thus the seq
column is not guaranteed to be sorted. A good thing about this is that
after the joblist is finished the joblog will look as if it was run in
one session.

* Skip commands that are already run. Here parallel will instead look
at the command actually run (the last column in --joblog) and skip
command that are already in the joblog.

If the input to parallel is exactly the same before and after the
reboot then there will be no difference between the two. But if the
input is changed (say you are running a command on all files in a dir
and now you have a few more files) then the last version is preferably
as the file names may not be in the same order any more.

The last version will fail if job list contains the same command with
the same arguments multiple times, as the last version will then skip
the duplicates.

Any other comments on this idea? Which version would you prefer?

[ ] Job number duplicate based
[ ] Command line duplicate based

/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]