RE: Parallel Merge

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Parallel Merge

From:	Nathan Watson-Haigh
Subject:	RE: Parallel Merge
Date:	Sat, 20 Aug 2011 08:24:51 +0930

AWRI logo

Nathan Watson-Haigh
Senior Bioinformatician | The Australian Wine Research Institute
Waite Precinct, Hartley Grove cnr Paratoo Road, Urrbrae (Adelaide) SA 5064 | Map
PO Box 197, Glen Osmond SA 5064, Australia
( +61 8 83136836 (direct) | ( +61 8 83136600 | Ê +61 8 83136601
8 www.awri.com.au | AWRI Events

This communication, including attachments, is intended only for the addressee(s) and contains information which might be confidential and/or the copyright of The Australian Wine Research Institute (AWRI) or a third party. If you are not the intended recipient of this communication please immediately delete and destroy all copies and contact the sender. If you are the intended recipient of this communication you should not copy, disclose or distribute any of the information contained herein without the consent of the AWRI and the sender. Any views expressed in this communication are those of the individual sender except where the sender specifically states them to be the views of the AWRI. No representation is made that this communication, including attachments, is free of viruses. Virus scanning is recommended and is the responsibility of the recipient.

-----Original Message-----

From: ole.tange@gmail.com on behalf of Ole Tange
Sent: Fri 19/08/2011 22:55
To: Nathan Watson-Haigh
Cc: parallel@gnu.org
Subject: Re: Parallel Merge

On Wed, Aug 17, 2011 at 1:13 PM, Nathan Watson-Haigh
<nathan.watson-haigh@awri.com.au> wrote:
>
> I just found this tool today and I'm very impressed with it!

Good to see a fellow bioinformatician finds it useful. If you like GNU Parallel:

* Post the intro videos on forums/blogs/Google+/Twitter/Facebook/Linkedin
* Request or write a review for your favourite blog or magazine
* Invite me as a speaker for your next conference

If GNU Parallel saves you money:

* (Have your company) donate to FSF https://my.fsf.org/donate/

> I've seen the example for parallelising sort....is there a way to spread the merge part amoung more cores? I'm trying to do a sort over 200million lines - the merge part takes several minutes to complete.
>
> How do I write a script for this so I can use it as a drop-in replacement for a sort command used in the pipeline of another software tool?

That is an interesting idea. If you 200M lines are in a single file,
you can use 'parallel --pipe'. This will distribute the merge to all
cores. Only the final merge will be done by one core. If bigfile is 1G
you will in total read 3G and write 3G (which is clearly not optimal):

cat bigfiles* | parallel --pipe --files sort | parallel --files -X
sort -m {} ';' rm {} | parallel -Xj1 sort -m {} ';' rm {}
>bigfile.sort

However, --pipe is somewhat slow (less than 100MB/s), and thus may be
a bottleneck. So for an optimal solution we probably need to split up
into several situations.

Number of files: nfiles (1 if reading from stdin (standard input))
Total size of files: sfiles
Memory available: RAM
Number of processes that UNIX allows to be run in parallel: npar
Number of processors: nproc
Number of filenames that fit on the command line: ncmd

If sfiles < RAM then we should read all into RAM and process it there
without temporary files.
If nfiles > nproc and nfiles < npar we can run one process per file
and do the merge on the fly.
If nfiles > nproc and nfiles > npar we group the nfiles into npar
groups. Each group is sorted sequentially but the groups are run in
parallel. The output is merged on the fly.
If nfiles < nproc the content of the files need to be split into nproc
(or more) chunks.

If sfiles > RAM then we need to use temporary files. It should be
possible to only write a temporary file once (so in total to sort 1G
you need to read 2G and write 2G).
If nfiles > nproc and nfiles < npar we can run one process per file
saving each result to a temporary file and finally do the merge of all
the files.
If nfiles > nproc and nfiles > npar we group the nfiles into npar
groups. Each group is sorted sequentially but the groups are run in
parallel with the output saved to a temporary file for each group.
After the sort the tempfiles are merged.
If nfiles < nproc the content of the files need to be split into nproc
(or more) chunks. Each chunk is then sorted, saved to a tempfile and
then merged.

The below tries to let 'sort' deal with the problem of temporary
files. It should scale upwards to npar*ncmd files. It reads a list of
files to sort on stdin (standard input). Thus it does not work for
data on a pipe. I am sure it also has other bugs, but it might be OK
for your purpose.

#!/bin/bash

DIR=$(mktemp -d /tmp/parsort.XXXXX)
cat >$DIR/filenames
# Create a bunch of fifos for sorting into
cat $DIR/filenames | parallel -Xj0 mkfifo $DIR/sort-{#}
# Start the merge sort into the fifos
cat $DIR/filenames | parallel -Xj0 sort {} \>$DIR/sort-{#} &

# Create a bunch of fifos for merging into
parallel -X mkfifo $DIR/merge-{#} ::: $DIR/sort-*
# Multilevel merge
parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} ::: $DIR/sort-* &
# Read and merge from the fifos
sort --batch-size=1000 -m $DIR/merge-*
# Remove the fifos
rm -rf $DIR

For sorting the output from a pipe we need to split the pipe with
--pipe. Unfortunately we do not know how big the input is, so we do
not know how many chunks we have. It would be handy if GNU Parallel
could distribute the input among N processes so when one process is
finished reading the same process will get another chunk.
Unfortunately that is not how GNU Parallel works today.

#!/bin/bash

DIR=$(mktemp -d /tmp/parsort.XXXXX)
parallel --block 100M --pipe --files sort -S110000000 > $DIR/sort-files

# Create a bunch of fifos for merging into
parallel -X mkfifo $DIR/merge-{#} :::: $DIR/sort-files
# Multilevel merge
parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} :::: $DIR/sort-files &
# Read and merge from the fifos
sort --batch-size=1000 -m $DIR/merge-*
# Remove the fifos and the tempfiles
rm -rf $(cat $DIR/sort-files) $DIR

None of the solutions above are perfect for all situations, but for
some situations they should do a reasonable job. On my tests they are
both faster than GNU sort.

/Ole

Hi Ole,

Thanks for the in-depth response - I'll have a play next week!

I'm pretty sure the bottleneck in the pipeline I'm using is upstream of the sort, so parallelising this is unlikely to get me anywhere faster. You might be interested in this: http://www.cs.wustl.edu/~jain/cse567-08/ftp/sorting/index.html#s3

What I'm actually doing is using the ABySS genome assembler. Part of the pipeline is:

KAligher | ParseAligns | sort | DistanceEst

KAligner takes sequences from one file (queries) and finds alignments agianst sequences in another file (targets), outputting these in Sequence Alignment/Map (SAM) format. ParseAligns takes the SAM format and filters out some alignments. It is the ParseAligns step which is slowest and I'm looking at how best to split up the work to make use of more cores. A job for early next week!

Cheers,
Nathan

[Prev in Thread]

Current Thread

[Next in Thread]

Parallel Merge, Nathan Watson-Haigh, 2011/08/18
- Re: Parallel Merge, Ole Tange, 2011/08/19
  - RE: Parallel Merge, Cook, Malcolm, 2011/08/19
    - RE: Parallel Merge, Nathan Watson-Haigh, 2011/08/19
  - RE: Parallel Merge, Nathan Watson-Haigh <=
    - Re: Parallel Merge, Ole Tange, 2011/08/20
    - Job Processing Was RE: Parallel Merge, Nathan Watson-Haigh, 2011/08/23
    - Re: Job Processing Was RE: Parallel Merge, Ole Tange, 2011/08/23
    - RE: Job Processing Was RE: Parallel Merge, Nathan Watson-Haigh, 2011/08/23
    - Re: Job Processing Was RE: Parallel Merge, Ole Tange, 2011/08/24

Prev by Date: RE: Parallel Merge
Next by Date: Re: Parallel Merge
Previous by thread: RE: Parallel Merge
Next by thread: Re: Parallel Merge
Index(es):
- Date
- Thread