bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Parallel Bug Reports imperfect parallelization


From: FrithMartin
Subject: Re: GNU Parallel Bug Reports imperfect parallelization
Date: Mon, 29 Jun 2015 08:13:29 +0000

Hello,

using -N1 fixes the example that I gave: thank you.

But it doesn't fix my real problem. My example was too simple, sorry. Here's 
another example.
Let's make a fake genome with 12 big chromosomes and 100000 tiny chromosome 
fragments (which is not unusual):

seq 12000000 | awk 'NR % 1000000 == 1 {print ">"} {print "aaaaaaaaa"}' > 
fake-genome.fasta
seq 1000000 | awk 'NR % 10 == 1 {print ">"} {print "aaaaaaaaa"}' >> 
fake-genome.fasta

(This is still simplified, because the big chromosomes have identical sizes, 
and so do the small ones: in reality the sizes would vary.)

This time, -N1 does not work well (I tried it), because it separates all the 
tiny chromosomes too. I would like the tiny chromosomes to be batched into 
reasonably-sized batches, and the big chromosomes to all be separated. The 
following command with my patch works quite well:

parallel --pipe --recstart '>' -k wc < fake-genome.fasta

1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1000001 1000001 10000002
1120748 1120748 11119656
 979242  979242 9080244
     11      11     102

Often, the run-time of parallel is negligible compared to the run-time of my 
analysis (which is usually not "wc"), so it's OK if parallel is a bit 
inefficient due to increasing block-sizes.

I'm afraid I don't understand the comments about blocksize fitting 2 or more 
records.
(In the example with "--block 20000000": without my patch it gives a single 
record to the first and last jobs, and with my patch it gives a single record 
to all jobs. I'm not sure what to conclude from that.)

Have a nice day,
Martin

________________________________________
From: address@hidden <address@hidden> on behalf of Ole Tange <address@hidden>
Sent: Saturday, June 27, 2015 12:30 AM
To: FrithMartin
Cc: address@hidden
Subject: Re: GNU Parallel Bug Reports imperfect parallelization

On Fri, Jun 26, 2015 at 8:50 AM, FrithMartin <address@hidden> wrote:

> I would like to use GNU parallel to analyze genome sequences, so that I can 
> analyze the chromosomes in parallel.

Sounds like good use of GNU Parallel.

> Let's make a fake genome, with 12 equal-sized chromosomes, in FASTA format:
>
> seq 12000000 | awk 'NR % 1000000 == 1 {print ">"} {print "aaaaaaaaa"}' > 
> fake-genome.fasta

Reasonable.

> If I have >=12 CPUs, I should be able to get a 12-fold speedup, by analyzing 
> all the chromosomes in parallel.

... if and only if you manage to get one chromosomes run on each CPU.
So you need to tell GNU Parallel that this is what you want to do.

> Let's try:
>
> parallel --pipe --recstart '>' -k wc < fake-genome.fasta

Whoops: You forgot to tell GNU Parallel that you want to give a single
chromosome (record) to each CPU. You do that by using -N1.

> parallel: Warning: A record was longer than 1048576. Increasing to 
> --blocksize 1363150.

GNU Parallel is here telling you, that it might be doing something
that you did not intend to happen.

> parallel: Warning: A record was longer than 1363150. Increasing to 
> --blocksize 1772096.

It tries its best to mitigate the problem by increasing the block
size, so the block size is guaranteed to contain a full record.

> parallel: Warning: A record was longer than 1772096. Increasing to 
> --blocksize 2303726.

But since a full record is much larger than the initial block size, it
has to exponentially increase the block size multiple times...

> parallel: Warning: A record was longer than 2303726. Increasing to 
> --blocksize 2994845.
> parallel: Warning: A record was longer than 2994845. Increasing to 
> --blocksize 3893300.
> parallel: Warning: A record was longer than 3893300. Increasing to 
> --blocksize 5061291.
> parallel: Warning: A record was longer than 5061291. Increasing to 
> --blocksize 6579680.
> parallel: Warning: A record was longer than 6579680. Increasing to 
> --blocksize 8553585.
> parallel: Warning: A record was longer than 8553585. Increasing to 
> --blocksize 11119662.

until it finally reaches a block size that holds a full record from
the position it started reading.

> 1000001 1000001 10000002
> 1000001 1000001 10000002
> 1000001 1000001 10000002
> 1000001 1000001 10000002
> 1000001 1000001 10000002
> 1000001 1000001 10000002
> 1000001 1000001 10000002
> 1000001 1000001 10000002
> 2000002 2000002 20000004
> 1000001 1000001 10000002
> 1000001 1000001 10000002
>
> It did not separate the 9th and 10th chromosomes, so I only get a 6-fold 
> speedup.

Yup: You did not tell it only to look for a single record, and you
gave it a --block-size that did not fit two or more records.

> The root cause is that it *adds* blocksize bytes to the partial record 
> already in memory. This means that the chunk size increases even when the 
> blocksize does not increase. To fix this, instead of reading blocksize bytes, 
> read (blocksize minus partial-record-size) bytes. I attach a patch that fixes 
> this.

But your patch does not fix the problem in general. In your case a
record is 10000002. So 20000000 will only fit a single record. But
running:

  parallel --block 20000000 --pipe --recstart '>' -k wc < fake-genome.fasta

will only give a single record to the first and the last job.

In other words: You are using --block-size for something it is not
designed to do. If you get the warning, it is because you should
change the block size. In next version I will put this in the section
on --block-size:

  For performance reasons size should be bigger than two records.

Actually it should be bigger than N+1 records, if you use -N.

> P.S. I also request to remove the increasing blocksize warnings, if the user 
> did not specify a blocksize, because they are harmless and just cause 
> needless concern.

Your error report tell me that it is not the case: When you get these
warnings it is typically because you have misunderstood the purpose of
--block-size.

If you now feel you understand the purpose, feel free to help by
suggesting changes to the man page or the warning.


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]