coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

sort --head


From: Mark E. Shoulson
Subject: sort --head
Date: Tue, 18 Jun 2024 21:49:35 -0400
User-agent: Mozilla Thunderbird

Hi.  New here, to post a patch which I hope will inspire someone to do a better job.

I can't really believe that for all these years, we still so very, very often run sort(1) and pipe it through head(1) or tail(1), because we only want the top/bottom 20 lines or whatever.  That means we're sorting a potentially gigantic list of lines but not caring about most of the information.

I'm attaching a patch that adds `--head` and `--tail` options to sort(1), so you say `sort --head=20` (which is equivalent to saying `sort --tail=-20` and vice-versa).  I've worked this into sort(1) VERY VERY BADLY, I'm sorry to say, but maybe someone with better familiarity with the code can do it better.  Basically, I'm special-casing out the situation when --head is active and doing my own thing with it, and then lying to the rest of the program about it.  But it works, anyway.

Performance: with a file consisting of 8,000,000 random strings, `sort | head -20` takes about 3.5sec (three sample consecutive runs yield 3.443, 3.669, 3.372 sec; presumably the I/O cached is as cached as it can be).  If we stipulate `--parallel=1` (no parallel sorting), then we get about 24.5sec (24.230, 24.516, 24.157).  But `sort --head=20` takes just 1.36 seconds, even with parallel=1 (because I didn't bother with any parallel stuff anyway.)  I think that's significant enough to be worth a look.

The algorithm is straightforward: have a buffer of k lines; as each line comes in, if it's bigger/smaller than the kth line (or the buffer isn't full yet) insertion-sort it into the buffer and remove the kth line.  Otherwise, don't even bother processing it further.  It's an O(nk) operation at worst, and k << n, basically a constant, so it's O(n).  (My buffer actually COPIES the lines (sorry) because just shifting around pointers didn't work; pointers wound up pointing to the wrong places at the end. But this way the memory usage is limited, so it works out.)

I didn't even document it yet, please let me know what you think.  Does someone want to do this better?  Or just want me to write the docs and resubmit?

Thanks,

~mark

Attachment: sorthead.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]