parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU parallel - resumable jobs


From: rambach
Subject: Re: GNU parallel - resumable jobs
Date: Wed, 14 Dec 2011 14:35:46 +0100
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7

On 12/12/2011 11:07 PM, Ole Tange wrote:
On Mon, Dec 12, 2011 at 4:42 PM, rambach<rambachr@yahoo.com>  wrote:
Hello,

i often have long running jobs, which take several days.
sometimes i have to interrupt those jobs and resume them later (i.e. reboot)

in the past i wrote custom wrappers for each of these jobs, but am now
looking into using GNU Parallel instead.
my idea was to use an intermediate job-flow program which would save the
current line number processed, as in

jobflow.pl:

#!/usr/bin/env perl

use strict;
use warnings;
use File::Slurp;
use IO::Handle;

STDOUT->autoflush(1);

my $skip = 0;
my $statefile = undef;

while(@ARGV) {
    my $arg = shift;
    if($arg eq "--skip") {
        $skip = shift;
        die unless defined $skip;
    } elsif($arg eq "--statefile") {
        $statefile = shift;
        die unless defined $statefile;
    } elsif($arg eq "--resume") {
        die "resume needs --statefile set" unless defined $statefile;
        $skip = read_file $statefile if (-e $statefile);
    }
}

my $n = 0;

while(<>) {
    if($skip) {
        $skip--;
    } else {
        print;
        write_file $statefile, $n if defined $statefile;
    }
    $n++;
}


print-and-sleep.sh:
#!/bin/sh
echo $1
sleep 1

seq 100 | ./jobflow.pl --statefile /tmp/jobstate --resume | parallel
./print-and-sleep.sh


however this does not work as expected because of the buffering of the pipe.
i.e. after hitting CTRL-C after the first 10 lines printed, and running the
same line again, it would already save "100" to the statefile, thus not
continuing with 11 after another invocation of the same command line.

from what i can say, this functionality had to be built into the program
which actually launches the worker process, in this case GNU Parallel.

i'd be interested to hear your opinion on this matter; and how one could
approach it in the most elegant way.

best regards,
roland rambach
So you basically have a long list of jobs (either in a file or
generated by parallel self):

job1
job2
:
job_k
:
job_n

job1-job_k finishes, and job_k+1 and some more jobs are running when
the machine suddenly reboots. After the reboot you would like an easy
way to skip job1-job_k and immediately start job_k+1.

indeed.

I am thinking of re-using --joblog and adding --resume. I see at least
2 approaches:

* Only look for the job-number. Here parallel will simply look for the
seq (first column in --joblog) and skip the jobs that have a seq that
is already in the joblog. It cannot just look for the max job number,
as some early jobs may take longer than some later jobs thus the seq
column is not guaranteed to be sorted. A good thing about this is that
after the joblist is finished the joblog will look as if it was run in
one session.

this approach is similar to what i had in mind.
however it is possible that *millions* of jobs are being run.
that would mean that the logfile could easily hit +1GB and possibly even exceed the amount of data generated by the jobs themselves.


so for this specific usage case it is advantageous if only the seq number of the last completed job is saved in a special file.


if instead the logfile would be "abused" for the resume capability, GNU Parallel had to read only the last (few) line(s) of the log to get the sequence number of the last completed job and then open it in append mode for further output, in order to keep the memory footprint minimal.

would it make sense to force --group and/or --keep-order ? i'm thinking about a situation were job k+1 has been finished, but not job k, so GNU Parallel would resume on k+1 and "forget" about job k.


* Skip commands that are already run. Here parallel will instead look
at the command actually run (the last column in --joblog) and skip
command that are already in the joblog.

If the input to parallel is exactly the same before and after the
reboot then there will be no difference between the two. But if the
input is changed (say you are running a command on all files in a dir
and now you have a few more files) then the last version is preferably
as the file names may not be in the same order any more.

indeed, this solution is superior if the input data can change; but it requires that the log had to parsed before each job. if the log was big, it would waste precious RAM, and could even lead to long processing times before each job can be started (potentially even longer than the time needed for the actual job to run).

therefore i prefer the previous solution.

in my typical use cases, the input would stay the same on each invocation.

The last version will fail if job list contains the same command with
the same arguments multiple times, as the last version will then skip
the duplicates.

Any other comments on this idea? Which version would you prefer?

[ ] Job number duplicate based
[ ] Command line duplicate based

/Ole

best regards,
roland rambach





reply via email to

[Prev in Thread] Current Thread [Next in Thread]