[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Parallel Bug Reports Truncated large records

From: Johannes Dröge
Subject: GNU Parallel Bug Reports Truncated large records
Date: Mon, 23 Feb 2015 14:28:07 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0

Hi Ole and GNU parallel devs,

I'm processing large files (~50 GiB) with variable record sizes and have the 
following issues:

1) The processing run-time of individual blocks is more than linear with the 
input size. Therefore, it would be best if GNU parallel would allow pass single 
records or a fixed number of records for each job, or at least would not 
automatically increase the block size. Instead, the block size auto-detection 
increases the block size on large individual blocks until only very few 
processes are being run in parallel which then dominate the overall run-time. 
This behavior strongly impacts the granularity of the parallel execution.

2) I'm seeing that large records (>2 GiB) are being truncated at 2 GiB and thus 
passed incompletely via stdin. You find my compressed input under

 (~1.2 GiB, valid until March 2015)

and I'm processing the data as follows:

zcat debug.maf.gz | parallel --halt-on-error --no-notice --gnu --pipe 
--recstart '# batch ' --recend '\n\n' 'cat > "$PARALLEL_SEQ".maf'

You will see that only one job and output file is created because the first 
record is the largest one. Then, the output is truncated after exactly 2 GiB. I 
think this is a serious issue as this is a silent data corruption and will 
affect the analysis if, for instance biological sequence data is shortened 
before analysis.

Info: I'm using the latest version of GNU parallel (20150122) on 64 bit Linux, 
Debian 7.

Thanks for your help.

Gruß Johannes

Johannes Dröge, M.Sc.
Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Universitätsstraße 1, 40225 Düsseldorf, Germany
PGP: http://keys.fungs.de/6ea5e4.asc (55F2720303A7F236A94666F20E2360727A6EA5E4)
Web: algbio.cs.uni-duesseldorf.de | Tel/Fax: +49 211 81-12644/13464

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]