parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU parallel - resumable jobs


From: Ole Tange
Subject: Re: GNU parallel - resumable jobs
Date: Fri, 16 Dec 2011 12:45:02 +0100

On Fri, Dec 16, 2011 at 9:01 AM, rambach <rambachr@yahoo.com> wrote:
> On 12/15/2011 11:35 PM, Ole Tange wrote:
>> On Wed, Dec 14, 2011 at 2:35 PM, rambach<rambachr@yahoo.com>  wrote:
>>> On 12/12/2011 11:07 PM, Ole Tange wrote:

>>>> I am thinking of re-using --joblog and adding --resume. I see at least
>>>> 2 approaches:
>>>>
>>>> * Only look for the job-number.
:
> i typically have around 10 - 20 millions jobs (each one taking about 0.3 -
> 1.5 sec), so depending on the length of the command line the log can easily
> reach 2 GB. which is way bigger than what fits into the machine's RAM, i.e.
> the log had to be read line-by-line, not in a File::Slurp fashion...
>
> i currently run my jobs on a handful of old Pentium III - IV's with 128 -
> 512 MB RAM each.
> i plan to replace them with a dozen of raspberry pi's as soon as they're
> available.
> that way i can cut down my power costs by a big margin.
> but those embedded ARM boxes also only sport 128/256MB RAM.

Good to know. I would normally make a hash lookup because they are
fast in perl (even with millions in the hash - I can do 2 million hash
accesses/sec in a 10 million hash) and easy. But in this case we can
use something more memory efficient.

E.g. a bit vector (so each job run takes up 1 bit of mem). Something similar to:

while(<LOG>) {
    /^(\d+)/ || die;
    # This is 30% faster than set_job_already_run($1);
    vec($Global::job_already_run,$1,1) = 1;
}

if(is_job_already_run($seq)) {
    # Skip
} else {
    # Run it
}

sub set_job_already_run {
    my $seq = shift;
    vec($Global::job_already_run,$seq,1) = 1;
}

sub is_job_already_run {
    my $seq = shift;
    return vec($Global::job_already_run,$seq,1);
}

For 20 million jobs that will cost you 1.2 MB memory.

>>> so for this specific usage case it is advantageous if only the seq number
>>> of
>>> the last completed job is saved in a special file.
>>
>> That will not work for a general case: Assume job 1-1000 takes 1
>> second, except for job 3 which takes 10000 seconds.
>>
>> In that situation it is not enough just to save the last completed
>> job. You need a way to say that job 1-2+4-1000 has finished.
>>
> alright, but it would basically suffice to only save the numbers of the jobs
> that haven't been run up to a certain point.
> so using your example it could look like:
> last job finished: 1000
> missing up to this point: 3

I am not too fond of including yet another status file that is
basically just a summary of --joblog.

>>>> * Skip commands that are already run. Here parallel will instead look
>>>> at the command actually run (the last column in --joblog) and skip
>>>> command that are already in the joblog.
>>>
>>> indeed, this solution is superior if the input data can change; but it
>>> requires that the log had to parsed before each job.
>>
>> It would be parsed when you start GNU Parallel (i.e. once). After that
>> it will be a simple lookup in a hash table (which is fast).
>
> fast, but huge. imagine a hashtable with entries for 20 million jobs
> (strings).
> the hashtable would be even bigger than the logfile and easily exhaust all
> RAM even on a big server, and possibly the entire address space on a 32bit
> host.
>
> this solution doesn't scale very well.
>
> depending on the number of buckets the perl hashtable implementation uses,
> even a hashtable lookup could become a bottle neck on this amount of data:
> with 1000 buckets, each bucket would contain 20.000 elements which must be
> compared against the current job string.

Do not underestimate the power of hashes in Perl - they are extremely
well implemented. But even so, your memory concern makes me more
confident that the right choice is to simply look for the job number.


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]