parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parallelising grep


From: Ole Tange
Subject: Re: Parallelising grep
Date: Mon, 12 Aug 2013 08:31:24 +0200

On Mon, Aug 12, 2013 at 5:50 AM, Nathan S. Watson-Haigh
<nathan.haigh@acpfg.com.au> wrote:
> Hi Ole,
>
> The number of lines (reads) in reads.ids is ~9 million. The number of 
> alignment lines in the SAM/BAM file is ~372,281,262.

The only thing I would change is the block size of the first:

$ cat read.ids | parallel --round-robin --pipe --block 100k cat ">"id.{#}
$ parallel "samtools view in.bam | fgrep -w -f {}" ::: id.* > alignments.txt

But IIRC 'samtools view' is quite expensive. So doing that for each
fgrep feels like a waste of CPU power. A more efficient way would be
something like this:

# Create id chunk and fifo per CPU
$ cat read.ids | parallel --round-robin --pipe --block 100k "mkfifo
fifo.{#}; cat > id.{#}"
# unpack the bam file into to all fifos
$ samtools view in.bam | tee fifo.*  >/dev/null &
$ parallel -j0 --xapply fgrep -w -f {1} {2} ::: id.* ::: fifo.* > alignments.txt
# cleanup
$ rm fifo.* id.*

I am contemplating implementing this as a general function that would
pass the same data on stdin to each of the programs. So the above
would be:

$ cat read.ids | parallel --round-robin --pipe --block 100k "cat > id.{#}"
$ samtools view in.bam | parallel --tee-pipe fgrep -w -f {} ::: id.* >
alignments.txt
$ rm id.*

It would have the limitation that it would have to run all the id.* in
parallel (-j0) as GNU Parallel would otherwise have to cache the full
output from samtools.

I am not really sure if it is generally useful, but if you can come up
with other situations where --tee-pipe would be useful I might look
into it.


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]