[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH] sort: add --threads option to parallelize internal sort.

From: Chen Guo
Subject: [PATCH] sort: add --threads option to parallelize internal sort.
Date: Mon, 8 Mar 2010 01:11:36 -0800 (PST)

Hi all,
    This is a patch to speed up sort by parallelizing the internal portion of 
it. A basic description of the algorithm is as follows, taken directly from the 
comments in sort.c.

/* There are three phases to the algorithm: node creation, sequential sort,
   and binary merge.

   During node creation, sortlines recursively visits each node in the
   binary merge tree and creates a NODE structure corresponding to all the
   future line merging NODE is responsible for. For each call to
   sortlines, half the available threads are assigned to each recursive
   call, until a leaf node having only 1 available thread is reached.

   Each leaf node then performs two sequential sorts, one on each half of
   the lines it is responsible for. It records in its NODE structure that
   there are two sorted sublists available to merge from, and inserts its
   NODE into the priority queue.

   The binary merge phase then begins. Each thread drops into a loop
   where the thread retrieves a NODE from the priority queue, merges lines
   available to that NODE, and potentially insert NODE and/or its parent back
   into the queue if there are sufficient available lines for them to
   merge. This continues until all lines at all nodes of the merge tree
   have been merged. */

The following runs are conducted on an idle Intel Xeon X5460, 2 dies of 4 cores 
each. All values are averages of 20 runs.

T=1: 5.10s
T=2: 2.87s
T=3: 2.71s
T=4: 1.75s
T=5: 1.66s
T=6: 1.65s
T=7: 1.67s
T=8: 1.31s

Compare with sort's current implementation:
T=1: 5.44s

And the patch submitted by Glen Lenker last year:
T=2: 3.31s
T=3: 3.31s
T=4: 2.32s
T=5: 2.34s
T=6: 2.34s
T=7: 2.33s
T=8: 1.91s

So on 8 threads, the improvement is 315% over the current algorithm, or more 
than 4X as fast. Compared with the current algorithm's 1 thread time, the 
improvement is 289%, still almost 4X as fast.

I've attached the patch to coreutils as pq.patch, and a patch to Gnulib that 
this patch depends on as memcoll.patch. CC'd are my group mates, Glen Lenker, 
and Professor Eggert.

Attachment: pq.patch
Description: Binary data

Attachment: memcoll.patch
Description: Binary data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]