[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] sort: Add --threads option, which parallelizes internal sort
Re: [PATCH] sort: Add --threads option, which parallelizes internal sort.
Tue, 31 Mar 2009 00:51:57 -0700
Sorry, I forgot to cc the mailing list...
On Sat, Mar 28, 2009 at 04:51:52PM +0100, Ralf Wildenhues wrote:
> Hello Glen,
> * Glen Lenker wrote on Fri, Mar 27, 2009 at 11:07:19PM CET:
> > On Thu, Mar 26, 2009 at 09:50:08PM +0000, Ralf Wildenhues wrote:
> > > Example run, on an 8-way, and with cat'ed instances of the dictionary,
> > > on tmpfs, timings best of three:
> > > It suggests to me that too much time is spent busy-waiting in
> > > pthread_join,
> > > or that sort is computing too much (I haven't looked at the patch in
> > > detail).
> > >
> > > Also, I'd have expected the rate going from 1 to 2 threads to get at least
> > > a bit better with bigger file size, but it remains remarkably constant,
> > > around 1.45 for this setup. What am I missing?
> > What is the specifications for your computer? Does 8-way == 8 cores?
> Yes. Or 2 CPUs with 4 cores each or 4 CPUs with 2 cores each, I don't
> remember. Is that distinction important?
Sorry about that. I could've guessed, but I wasn't sure that 8-way
wasn't something I haven't heard of yet. When I think about it, the
difference between a 2 CPU 4 core machine and a 4 CPU 2 core machine
may actually matter when it comes to caches, IO bandwidths, etc.
> > You are probably not missing anything, the parallelization we used is
> > rather simplistic. We were given only ten weeks and getting gnulib to
> > work for us took a lot more time than we expected.
> Sure. Thanks for doing the work!
> > The TODO actually
> > mentions a possible optimization to would probably reduce the amount
> > of time wasted blocking on pthread_join. Paul Eggert thought up of
> > this idea a few weeks ago and I haven't been able to try my hand at it
> > yet.
> I haven't quite understood the what TODO entry suggests, and I also
> haven't taken the time to look at Knuth's description of the
> parallelization, but if it's a straight-forward divide and conquer
> of an O(N log N) algorithm where, in the limit, only O(N) amount of
> work is not parallelizable (the merge part), then the ratio from 1
> to 2 threads should improve with larger input size N, even without
> implementing the TODO entry, no?
I am under the impression that pthread_join doesn't busy wait, can
someone confirm this? Also, looking at the timings it seems that
doubling the number of threads correlates with the CPUs being
26.95user 1.67system 0:28.62elapsed 100%CPU ~ 100% / cpu
30.78user 1.98system 0:19.49elapsed 168%CPU ~ 84% / cpu
37.41user 2.04system 0:15.17elapsed 260%CPU ~ 65% / cpu
60.68user 2.79system 0:15.18elapsed 417%CPU ~ 52% / cpu
This decrease of CPU utilization sounds a little too large to be
caused solely by our calls to pthread_join, so there is probably
something else at work here..
> I haven't thought about whether cache or bandwidth-limit effects
> can cause this.
Is the cache or the memory bandwidth on this machine special on this
Thanks again Ralf for running this patch on your 8-way machine again. This
problem wasn't as obvious on a dual-core machine as it is in your
Conversely, this problem could actually explain why on a dual-core
machine increasing the number of threads to 4 generally increased
performance. This was something that puzzled me at the time, and was
If you could, can you trying increasing the number of threads on these
tests until no performance increase can be detected?
Perhaps the default number of threads used by the multicore sort
should take this into account instead of just using the number of