[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH] sort: add --threads option to parallelize internal sort.
From: |
Chen Guo |
Subject: |
[PATCH] sort: add --threads option to parallelize internal sort. |
Date: |
Mon, 8 Mar 2010 01:11:36 -0800 (PST) |
Hi all,
This is a patch to speed up sort by parallelizing the internal portion of
it. A basic description of the algorithm is as follows, taken directly from the
comments in sort.c.
/* There are three phases to the algorithm: node creation, sequential sort,
and binary merge.
During node creation, sortlines recursively visits each node in the
binary merge tree and creates a NODE structure corresponding to all the
future line merging NODE is responsible for. For each call to
sortlines, half the available threads are assigned to each recursive
call, until a leaf node having only 1 available thread is reached.
Each leaf node then performs two sequential sorts, one on each half of
the lines it is responsible for. It records in its NODE structure that
there are two sorted sublists available to merge from, and inserts its
NODE into the priority queue.
The binary merge phase then begins. Each thread drops into a loop
where the thread retrieves a NODE from the priority queue, merges lines
available to that NODE, and potentially insert NODE and/or its parent back
into the queue if there are sufficient available lines for them to
merge. This continues until all lines at all nodes of the merge tree
have been merged. */
The following runs are conducted on an idle Intel Xeon X5460, 2 dies of 4 cores
each. All values are averages of 20 runs.
T=1: 5.10s
T=2: 2.87s
T=3: 2.71s
T=4: 1.75s
T=5: 1.66s
T=6: 1.65s
T=7: 1.67s
T=8: 1.31s
Compare with sort's current implementation:
T=1: 5.44s
And the patch submitted by Glen Lenker last year:
T=1:5.56s
T=2: 3.31s
T=3: 3.31s
T=4: 2.32s
T=5: 2.34s
T=6: 2.34s
T=7: 2.33s
T=8: 1.91s
So on 8 threads, the improvement is 315% over the current algorithm, or more
than 4X as fast. Compared with the current algorithm's 1 thread time, the
improvement is 289%, still almost 4X as fast.
I've attached the patch to coreutils as pq.patch, and a patch to Gnulib that
this patch depends on as memcoll.patch. CC'd are my group mates, Glen Lenker,
and Professor Eggert.
pq.patch
Description: Binary data
memcoll.patch
Description: Binary data
- [PATCH] sort: add --threads option to parallelize internal sort.,
Chen Guo <=