bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] sort: Add --threads option, which parallelizes internal sort


From: Paul Eggert
Subject: Re: [PATCH] sort: Add --threads option, which parallelizes internal sort.
Date: Fri, 03 Apr 2009 12:57:54 -0700
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.3 (gnu/linux)

Jim Meyering <address@hidden> writes:

> Ramping up to 5M lines, the resulting test takes almost 2 minutes and
> the sort itself took 34s on this particular quad-core system.  ...  A
> more interesting test would be to ensure that when run on a multi-core
> system sorting with --threads=2 is at least X% faster than sorting
> with --threads=1.

The patch I submitted used a small test because I'm developing on an
old, slow machine--a 2.4 GHz Pentium 4 with 0.5 MiB cache and 1 GiB RAM.
(I really need to upgrade it, but we have this little budget problem in
California state institutions right now....)

More important, it's not clear to me what the role of the test suite
ought to be.  Should the test really fail if it doesn't get enough
performance improvement with 2 threads?  How do we decide what's
"enough"?  None of our other tests are performance tests so we are in a
bit of a new ground here.

Given the costs involved I'm inclined to think that "make check" should
focus on correctness tests, and a new target ("make benchmark", say?)
should be used for performance tests.  But perhaps my view is colored by
my using such a slow machine.

> I'm a little reluctant to apply this patch in its current
> state, since it's not achieving reasonable efficiency.

We'll look into speeding it up.  I've asked Glen to generate profiling
information.  Obviously there's something screwy going on.  If it is a
CPU bottleneck in pthread_join, though, that suggests that there's
something wrong with pthread_join.  Pthread_join should not busy-wait.

Of course we cannot reasonably expect this one performance improvement
to make 'sort' run 16x faster on a 16-CPU machine.  That is because the
improvement parallelizes only one part of 'sort'.  Even assuming the
parallelized part gets done in essentially zero time, the rest of 'sort'
is still sequential (notably, input and output), and so it will dominate
the wall-clock time.  Obviously we would like to parallelize all of
'sort', but the idea is to do one step at a time, focusing on the
easiest parts first, and showing real improvements at each step.  I
would be quite happy if this particular improvement gave us a 2x
wall-clock improvement.

On Ralf's test (256 MB file, 8 CPUs), we saw:

N  Speedup  Overhead
2   1.47       14% 
4   1.89       38%
8   1.89      122%

  "N" is the number of threads.

  "Speedup" is the speedup in execution speed as measured by the wall-clock.

  "Overhead" is the total amount of extra user+system CPU resources
    devoted to the sort.

Obviously on Ralf's machine, there's no reason to use 8 threads.
However, it would certainly be reasonable for people to want to use 2 or
4 threads, certainly if the CPUs are otherwise idle, as they get their
answers back faster.  On a busy system, whether one would want 1, 2, or
4 threads would be a matter of preference; personally I'd go with 4,
though I could see that other people might want 2 or perhaps even 1.

With the above in mind, a simple workaround would be to ceiling the
default number of threads to 4 for now, since (as a practical matter) we
don't get wall-clock benefit above 4 threads.  However, I'd like us to
pursue the overhead issue first, to see if we can improvement.

> Also, it'd be nice to add the following:
>   - a test that uses --threads=N to exercise the new option-parsing code,
>       though if you adjust as suggested to compare --threads=N for
>       N=1&2, that's not needed.
>   - a NEWS item

OK, we'll look into that too.  Thanks.

-- Paul




reply via email to

[Prev in Thread] Current Thread [Next in Thread]