[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: parallel to gunzip and merge .gz files

From: Jay Hacker
Subject: Re: parallel to gunzip and merge .gz files
Date: Wed, 9 Nov 2011 16:47:16 -0500

I decided to time this several ways just for kicks.  I did this on a
16-processor Xeon X7350, using 96 very compressible text files of
about 95 MB compressed each (total 9 GB compressed, 236 GB
uncompressed), with the input in cache, reading and writing on
different 300+ MB/sec RAID arrays.  YMMV.

1. If you just need to treat your files as one big file for streaming
input to some other program, you can use process substitution:
other-prog <(zcat *.fastq.gz).  This is about the fastest and most
space-efficient you can hope for, but it may not work in your
0 sec

(1271 sec to run cat <(zcat *.fastq.gz) > /dev/null)

2. Note that the concatenation of multiple gzip files is a valid gzip
file, so you may not need to unzip them.  Beware that there are some
programs that don't correctly unzip such files (I'm looking at you,
$ cat *.fastq.gz > output.fastq.gz
44 sec

3. If you do end up needing to recompress them, you could look into
pigz, the "parallel implementation of gzip."  Note that it puts out
the same kind of concatenated gzip files that some systems don't read
$ zcat *.fastq.gz | pigz > output.fastq.gz
1320 sec

3.5. gzip files can't really be decompressed in parallel, but unpigz
tries its best:
$ unpigz -c *.fastq.gz | pigz > output.fastq.gz
1099 sec

4. If you're really stuck on parallel ;), you have lots of free memory
(size of largest uncompressed input * number of processes), and don't
care in what order the files are merged, something like this might
give you a small improvement:
$ TMPDIR=/dev/shm parallel zcat ::: *.fastq.gz | pigz > output.fastq.gz
961 sec

And finally, the simplest and most compatible but slowest method:
$ zcat *.fastq.gz | gzip > output.fastq.gz
3259 sec

Good hunting.

On Tue, Nov 8, 2011 at 12:07 AM, vijai2007 <address@hidden> wrote:
> Hello,
> I have about 99 files in the file name format:
> SRRA_ATCACG_L008_R1_001.fastq.gz to SRRA_ATCACG_L008_R1_099.fastq.gz
> I want to unzip and merge them to a single fastq file.
> These are from the Illumina CASAVA 1.8.1
> How do I do this in GNU parallel?
> Thanks
> vijai

reply via email to

[Prev in Thread] Current Thread [Next in Thread]