parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU parallel - resumable jobs


From: Ole Tange
Subject: Re: GNU parallel - resumable jobs
Date: Thu, 15 Dec 2011 23:35:24 +0100

On Wed, Dec 14, 2011 at 2:35 PM, rambach <rambachr@yahoo.com> wrote:
> On 12/12/2011 11:07 PM, Ole Tange wrote:
>>
>> On Mon, Dec 12, 2011 at 4:42 PM, rambach<rambachr@yahoo.com>  wrote:

>> I am thinking of re-using --joblog and adding --resume. I see at least
>> 2 approaches:
>>
>> * Only look for the job-number. Here parallel will simply look for the
>> seq (first column in --joblog) and skip the jobs that have a seq that
>> is already in the joblog. It cannot just look for the max job number,
>> as some early jobs may take longer than some later jobs thus the seq
>> column is not guaranteed to be sorted. A good thing about this is that
>> after the joblist is finished the joblog will look as if it was run in
>> one session.
>
>
> this approach is similar to what i had in mind.
> however it is possible that *millions* of jobs are being run.
> that would mean that the logfile could easily hit +1GB and possibly even
> exceed the amount of data generated by the jobs themselves.

If you have 1 million jobs at 100 bytes each that is 100 MB. That is
hardly scary. Reading 100 MB should also not be scary on modern
systems as long as it is only read once.

> so for this specific usage case it is advantageous if only the seq number of
> the last completed job is saved in a special file.

That will not work for a general case: Assume job 1-1000 takes 1
second, except for job 3 which takes 10000 seconds.

In that situation it is not enough just to save the last completed
job. You need a way to say that job 1-2+4-1000 has finished.

> would it make sense to force --group and/or --keep-order ? i'm thinking
> about a situation were job k+1 has been finished, but not job k, so GNU
> Parallel would resume on k+1 and "forget" about job k.

In that case I would prefer the general solution to keep numbers of all jobs.

>> * Skip commands that are already run. Here parallel will instead look
>> at the command actually run (the last column in --joblog) and skip
>> command that are already in the joblog.
>>
>> If the input to parallel is exactly the same before and after the
>> reboot then there will be no difference between the two. But if the
>> input is changed (say you are running a command on all files in a dir
>> and now you have a few more files) then the last version is preferably
>> as the file names may not be in the same order any more.
>
>
> indeed, this solution is superior if the input data can change; but it
> requires that the log had to parsed before each job.

It would be parsed when you start GNU Parallel (i.e. once). After that
it will be a simple lookup in a hash table (which is fast).

> if the log was big, it would waste precious RAM, and could even lead to long
> processing times before each job can be started (potentially even longer
> than the time needed for the actual job to run).

I am a bit curious what kind of jobs you run. How long does each job
run for? How many are you normally running? How big are your machines?


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]