Re: Parallelising grep

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parallelising grep

From:	Ole Tange
Subject:	Re: Parallelising grep
Date:	Mon, 12 Aug 2013 08:31:24 +0200

On Mon, Aug 12, 2013 at 5:50 AM, Nathan S. Watson-Haigh
<nathan.haigh@acpfg.com.au> wrote:
> Hi Ole,
>
> The number of lines (reads) in reads.ids is ~9 million. The number of 
> alignment lines in the SAM/BAM file is ~372,281,262.

The only thing I would change is the block size of the first:

$ cat read.ids | parallel --round-robin --pipe --block 100k cat ">"id.{#}
$ parallel "samtools view in.bam | fgrep -w -f {}" ::: id.* > alignments.txt

But IIRC 'samtools view' is quite expensive. So doing that for each
fgrep feels like a waste of CPU power. A more efficient way would be
something like this:

# Create id chunk and fifo per CPU
$ cat read.ids | parallel --round-robin --pipe --block 100k "mkfifo
fifo.{#}; cat > id.{#}"
# unpack the bam file into to all fifos
$ samtools view in.bam | tee fifo.*  >/dev/null &
$ parallel -j0 --xapply fgrep -w -f {1} {2} ::: id.* ::: fifo.* > alignments.txt
# cleanup
$ rm fifo.* id.*

I am contemplating implementing this as a general function that would
pass the same data on stdin to each of the programs. So the above
would be:

$ cat read.ids | parallel --round-robin --pipe --block 100k "cat > id.{#}"
$ samtools view in.bam | parallel --tee-pipe fgrep -w -f {} ::: id.* >
alignments.txt
$ rm id.*

It would have the limitation that it would have to run all the id.* in
parallel (-j0) as GNU Parallel would otherwise have to cache the full
output from samtools.

I am not really sure if it is generally useful, but if you can come up
with other situations where --tee-pipe would be useful I might look
into it.

/Ole

[Prev in Thread]

Current Thread

[Next in Thread]

Parallelising grep, Nathan S. Watson-Haigh, 2013/08/09
- Re: Parallelising grep, Ole Tange, 2013/08/09
  - RE: Parallelising grep, Nathan S. Watson-Haigh, 2013/08/11
    - Re: Parallelising grep, Ole Tange <=
- RE: Parallelising grep, Cook, Malcolm, 2013/08/09

Prev by Date: Re: parallel: This should not happen. You have found a bug.
Next by Date: Re: --round-robin and --keep-order with --pipe
Previous by thread: RE: Parallelising grep
Next by thread: RE: Parallelising grep
Index(es):
- Date
- Thread