bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13354: sort: File Merge Order is Suboptimal with Many Passes


From: Jason Bucata
Subject: bug#13354: sort: File Merge Order is Suboptimal with Many Passes
Date: Thu, 3 Jan 2013 22:07:00 -0600
User-agent: Mutt/1.5.21 (2010-09-15)

Over a year ago, I was working with sort on some large input files.  I
noticed a performance problem that related to the number of temp files being
used for sort runs.  As I recall, I'd also seen the problem prior to that,
but this was the first time I took an opportunity to diagnose the problem
in detail.

(As for why I waited so long to finally report it, more below.)

>From what I recall, it tended to want to do 8-way merges on my system.  If
the input file was just big enough, it would wind up with 9 files left over
toward the end.  It would merge the largest 8 files in another pass, and
then do one final merge to combine the now-huge temp file with the one,
likely very tiny temp file still remaining.

I found that the extra pass added a lot of extra time to the sort.  If I
bumped up the memory buffer to increase the initial run sizes so that I got
rid of one temp file, I could shave at least a few minutes off the total run
time: The little extra memory made a disproportionate difference, and I
could attribute it directly to the extra merge pass at the end that was
avoided.

At the time I reasoned that it would make the most sense to merge temp files
at the very beginning to reduce the number just enough to have a bunch of
--batch-size merge passes after that until the end.  In this case, if the
extra temp file at the end was 10 MB, then each merge pass would have to
process 10 MB more.  Over 4 merge passes, say, that's 40 MB extra to read.
But the benefit is saving the final merge pass with 10 MB plus something a
WHOLE log bigger than 40 MB, if it's big enough to require 4 merge passes to
combine it all in the first place!

I held off on reporting the bug because I wanted to see what Knuth had to
say on the subject.  Well, tonight I finally went to the library and looked
through volume 3 to see his advice. :)

In section 5.4.9 he basically echoes my thinking.

In the simple case where all runs were the same length, he recommends a
queue to track the files to process, with N files pulled from the front of
the queue and the merged output added to the back.  But, he says to add a
number of "dummy" empty files to the front queue.  The net effect is that
fewer than N files are merged the first time.

Looking at the merge function in sort.c, it appears that it can merge both
temp files and input files all at the same time.  With that, we probably
want an enhancement to that algorithm, which Knuth called "Huffman's method"
(referencing something in chapter 2).  For files or sort runs of varying
sizes, Knuth says to use a priority queue, to merge the smallest N files at
each pass.  He also says to add a number of 0-length dummy files at the
front of the priority queue so that the first merge pass grabs fewer than N
files.

If I understand his formula correctly, he suggests to add (1-F) % (N-1)
dummy runs, where N is --batch-size and F is the starting number of files.

To get it ideal, we'd need a priority queue implementation here, maybe a
heap or something.  (Irony of having to maintain a sorted list of files in a
sort program is duly noted...)  It would be an improvement, though, to
handle merging a smaller batch of files at the beginning, the equivalent of
adding 0-byte dummy files at the front.  That might be quicker to implement,
unless there's a GPL'd heap library lying around somewhere...

Jason B.

-- 
Half the harm that is done in this world is due to people who want to feel
important.
        -- T. S. Eliot





reply via email to

[Prev in Thread] Current Thread [Next in Thread]