parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Parallel Merge


From: Nathan Watson-Haigh
Subject: RE: Parallel Merge
Date: Sat, 20 Aug 2011 07:55:49 +0930

It does, but my limited testing showed no significant speed improvements when i used --parallel=8 over --parallel=1. top did not see sort go much beyond 100% CPU usage. Maybe disk IO was limiting me, but I have 4*15k rpm disks in raid 5 (over GlusterFS) so I wouldn't have thought so.

I'll have a look at this again next week, but it's slipping down the priority list as it's not the biggest bottleneck in the pipeline.

Cheers,
Nathan




AWRI logo 

Nathan Watson-Haigh
Senior Bioinformatician | The Australian Wine Research Institute
Waite Precinct, Hartley Grove cnr Paratoo Road, Urrbrae (Adelaide) SA 5064 | Map
PO Box 197, Glen Osmond SA 5064,
Australia
( +61 8 83136836 (direct) | ( +61 8 83136600 | Ê +61 8 83136601
8 www.awri.com.auAWRI Events


This communication, including attachments, is intended only for the addressee(s) and contains information which might be confidential and/or the copyright of The Australian Wine Research Institute (AWRI) or a third party. If you are not the intended recipient of this communication please immediately delete and destroy all copies and contact the sender. If you are the intended recipient of this communication you should not copy, disclose or distribute any of the information contained herein without the consent of the AWRI and the sender. Any views expressed in this communication are those of the individual sender except where the sender specifically states them to be the views of the AWRI. No representation is made that this communication, including attachments, is free of viruses. Virus scanning is recommended and is the responsibility of the recipient.

-----Original Message-----


From: Cook, Malcolm [mailto:address@hidden]
Sent: Fri 19/08/2011 23:26
To: 'Ole Tange'; Nathan Watson-Haigh
Cc: 'address@hidden'
Subject: RE: Parallel Merge

Yes, but, the latest gnu core utils provides parallelized version of sort. 

reading http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html

'--parallel=n'
Set the number of sorts run in parallel to n. By default, n is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using n threads increases the memory usage by a factor of log n. Also see nproc invocation.


~Malcolm

> -----Original Message-----
> From: address@hidden [mailto:parallel-
> address@hidden On Behalf Of Ole Tange
> Sent: Friday, August 19, 2011 8:25 AM
> To: Nathan Watson-Haigh
> Cc: address@hidden
> Subject: Re: Parallel Merge
>
> On Wed, Aug 17, 2011 at 1:13 PM, Nathan Watson-Haigh
> <address@hidden> wrote:
> >
> > I just found this tool today and I'm very impressed with it!
>
> Good to see a fellow bioinformatician finds it useful. If you like GNU Parallel:
>
> * Post the intro videos on
> forums/blogs/Google+/Twitter/Facebook/Linkedin
> * Request or write a review for your favourite blog or magazine
> * Invite me as a speaker for your next conference
>
> If GNU Parallel saves you money:
>
> * (Have your company) donate to FSF https://my.fsf.org/donate/
>
> > I've seen the example for parallelising sort....is there a way to spread the
> merge part amoung more cores? I'm trying to do a sort over 200million lines -
> the merge part takes several minutes to complete.
> >
> > How do I write a script for this so I can use it as a drop-in replacement for a
> sort command used in the pipeline of another software tool?
>
> That is an interesting idea. If you 200M lines are in a single file,
> you can use 'parallel --pipe'. This will distribute the merge to all
> cores. Only the final merge will be done by one core. If bigfile is 1G
> you will in total read 3G and write 3G (which is clearly not optimal):
>
> cat bigfiles* | parallel --pipe --files sort | parallel --files -X
> sort -m {} ';' rm {} | parallel -Xj1 sort -m {} ';' rm {}
> >bigfile.sort
>
> However, --pipe is somewhat slow (less than 100MB/s), and thus may be
> a bottleneck. So for an optimal solution we probably need to split up
> into several situations.
>
> Number of files: nfiles (1 if reading from stdin (standard input))
> Total size of files: sfiles
> Memory available: RAM
> Number of processes that UNIX allows to be run in parallel: npar
> Number of processors: nproc
> Number of filenames that fit on the command line: ncmd
>
> If sfiles < RAM then we should read all into RAM and process it there
> without temporary files.
> If nfiles > nproc and nfiles < npar we can run one process per file
> and do the merge on the fly.
> If nfiles > nproc and nfiles > npar we group the nfiles into npar
> groups. Each group is sorted sequentially but the groups are run in
> parallel. The output is merged on the fly.
> If nfiles < nproc the content of the files need to be split into nproc
> (or more) chunks.
>
> If sfiles > RAM then we need to use temporary files. It should be
> possible to only write a temporary file once (so in total to sort 1G
> you need to read 2G and write 2G).
> If nfiles > nproc and nfiles < npar we can run one process per file
> saving each result to a temporary file and finally do the merge of all
> the files.
> If nfiles > nproc and nfiles > npar we group the nfiles into npar
> groups. Each group is sorted sequentially but the groups are run in
> parallel with the output saved to a temporary file for each group.
> After the sort the tempfiles are merged.
> If nfiles < nproc the content of the files need to be split into nproc
> (or more) chunks. Each chunk is then sorted, saved to a tempfile and
> then merged.
>
> The below tries to let 'sort' deal with the problem of temporary
> files. It should scale upwards to npar*ncmd files. It reads a list of
> files to sort on stdin (standard input). Thus it does not work for
> data on a pipe. I am sure it also has other bugs, but it might be OK
> for your purpose.
>
> #!/bin/bash
>
> DIR=$(mktemp -d /tmp/parsort.XXXXX)
> cat >$DIR/filenames
> # Create a bunch of fifos for sorting into
> cat $DIR/filenames | parallel -Xj0 mkfifo $DIR/sort-{#}
> # Start the merge sort into the fifos
> cat $DIR/filenames | parallel -Xj0 sort {} \>$DIR/sort-{#} &
>
> # Create a bunch of fifos for merging into
> parallel -X mkfifo $DIR/merge-{#} ::: $DIR/sort-*
> # Multilevel merge
> parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} ::: $DIR/sort-* &
> # Read and merge from the fifos
> sort --batch-size=1000 -m $DIR/merge-*
> # Remove the fifos
> rm -rf $DIR
>
> For sorting the output from a pipe we need to split the pipe with
> --pipe. Unfortunately we do not know how big the input is, so we do
> not know how many chunks we have. It would be handy if GNU Parallel
> could distribute the input among N processes so when one process is
> finished reading the same process will get another chunk.
> Unfortunately that is not how GNU Parallel works today.
>
> #!/bin/bash
>
> DIR=$(mktemp -d /tmp/parsort.XXXXX)
> parallel --block 100M --pipe --files sort -S110000000 > $DIR/sort-files
>
> # Create a bunch of fifos for merging into
> parallel -X mkfifo $DIR/merge-{#} :::: $DIR/sort-files
> # Multilevel merge
> parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} :::: $DIR/sort-files &
> # Read and merge from the fifos
> sort --batch-size=1000 -m $DIR/merge-*
> # Remove the fifos and the tempfiles
> rm -rf $(cat $DIR/sort-files) $DIR
>
>
> None of the solutions above are perfect for all situations, but for
> some situations they should do a reasonable job. On my tests they are
> both faster than GNU sort.
>
>
> /Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]