|
From: | Mark E. Shoulson |
Subject: | sort --head |
Date: | Tue, 18 Jun 2024 21:49:35 -0400 |
User-agent: | Mozilla Thunderbird |
I can't really believe that for all these years, we still so very, very often run sort(1) and pipe it through head(1) or tail(1), because we only want the top/bottom 20 lines or whatever. That means we're sorting a potentially gigantic list of lines but not caring about most of the information.
I'm attaching a patch that adds `--head` and `--tail` options to sort(1), so you say `sort --head=20` (which is equivalent to saying `sort --tail=-20` and vice-versa). I've worked this into sort(1) VERY VERY BADLY, I'm sorry to say, but maybe someone with better familiarity with the code can do it better. Basically, I'm special-casing out the situation when --head is active and doing my own thing with it, and then lying to the rest of the program about it. But it works, anyway.
Performance: with a file consisting of 8,000,000 random strings, `sort | head -20` takes about 3.5sec (three sample consecutive runs yield 3.443, 3.669, 3.372 sec; presumably the I/O cached is as cached as it can be). If we stipulate `--parallel=1` (no parallel sorting), then we get about 24.5sec (24.230, 24.516, 24.157). But `sort --head=20` takes just 1.36 seconds, even with parallel=1 (because I didn't bother with any parallel stuff anyway.) I think that's significant enough to be worth a look.
The algorithm is straightforward: have a buffer of k lines; as each line comes in, if it's bigger/smaller than the kth line (or the buffer isn't full yet) insertion-sort it into the buffer and remove the kth line. Otherwise, don't even bother processing it further. It's an O(nk) operation at worst, and k << n, basically a constant, so it's O(n). (My buffer actually COPIES the lines (sorry) because just shifting around pointers didn't work; pointers wound up pointing to the wrong places at the end. But this way the memory usage is limited, so it works out.)
I didn't even document it yet, please let me know what you think. Does someone want to do this better? Or just want me to write the docs and resubmit?
Thanks, ~mark
sorthead.patch
Description: Text Data
[Prev in Thread] | Current Thread | [Next in Thread] |