bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parallelizing sort


From: Pádraig Brady
Subject: Re: Parallelizing sort
Date: Thu, 12 Feb 2009 10:15:59 +0000
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Glen Lenker wrote:
> On Wed, Feb 11, 2009 at 10:12:42PM -0800, Nima Nikzad wrote:
>> Hello coreutils community,
>> I am new to the exciting world of open source development and thought I
>> should introduce my self and get some feedback on what I am working on.  My
>> name is Nima Nikzad and I, along with a small group of students, am working
>> on changes to coreutils as part of a project for Prof. Paul Eggert's
>> software engineering course at UCLA.  Specifically, we are looking at how we
>> can better utilize modern multi-core processors when sorting and merging
>> files.  My group is first taking a look at how to parallelize the external
>> sort phase of sort's operation.  Another group of students is taking a look
>> at in-memory sort while we are working on this.
>>
>> Has anyone already tried their hand at parallelizing this code?  Are there
>> any special considerations we should keep in mind while moving forward?  My
>> group was looking to tackle the problem by using OpenMP as we have some
>> experience working with it in the past and like that we can do quite a bit
>> with it while (hopefully) having little impact on the structure of
>> the existing code.  Does anyone have any feedback on threading technologies
>> that would be appropriate for this project or does anyone think OpenMP is a
>> poor choice?
>>
>> I look forward to hearing your suggestions!  We are looking to have
>> something implemented and benchmarked soon and I will be sure to keep in
>> contact with the community.  Please feel free to reach me at nnikzad at
>> ucla.edu.  Thank you!
>>
>> - Nima Nikzad
> 
> Hi everyone,
> 
> My name is Glen Lenker. I am the project leader for the second group
> working to parallelize sort under Paul Eggert.
> 
> As mentioned above, my group is primarily focused on parallelizing the
> "in-memory" sort, but at the moment we are still considering working to
> parallelize other aspects of the current sort as well. At the moment we
> are considering using the Gnulib thread module as our threading library.
> 
> Jim: We heard from Paul Eggert today that you have looked into how sort
> would benefit from parallelization. I am particularly interested in your
> approach. If you don't mind, perhaps you could start the discussion on
> this.
> 

Well I was thinking along the lines of something more "UNIXy".

I'm worried a threading solution would complicate things.
The trade-off is increased data copying between processes,
but that also gives you the advantage of allowing you to
run the processes on separate hosts if desired.

Also the "UNIXy" solution should be quite easy to implement
and act as a baseline for comparison for any threading solution
you might think be better.

Parameters to vary when testing solutions would be:
  * files on SSD/ramdisk vs non cached files on harddisk
  * data < ram_size
  * data > ram_size
  * number of CPUS > data_size/ram_size
  * number of CPUS < data_size/ram_size
  * expensive sorting (-g for example) vs simple (ASCII data)

Here's is a very quick idea of splitting the data to sort
between different sort processes....

sort -m <(read_chunk 1 2 "$file" | sort) \
        <(read_chunk 2 2 "$file" | sort)

A wrapper script could be created to auto determine
(or specified with parameters) the correct number
of sort processes to execute, bearing in mind that
for traditional harddisks starting more than one
per spindle will only slow things down as they
fight over the disk head. Note the wrapper script
should probably take more than 1 file as a parameter
so that pre split files (possibly on different media)
can be processed.

The `read_chunk` process above is currently awkward and
inefficient to implement with dd and split. As a first step
I think it would be very useful to add a --number option to
`split`, like:
   --number=5    #split input into 5 chunks
   --number=2/5  #output chunk 2 of 5 to [to stdout]
In text mode, this should handle only splitting on a line
boundary, and adding any extra data into the last chunk.

cheers,
Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]