[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
## sort suggestion : --head/tail n option

**From**: |
Philippe De Muyter |

**Subject**: |
sort suggestion : --head/tail n option |

**Date**: |
Thu, 22 Apr 2004 18:48:49 +0200 (CEST) |

Hello,
I am considering adding a new option to the GNU `sort' utility :
the goal is to produce only the first or the last n lines of the sorted
output. Thus the result should be exactly equal to
sort <sort-options> | tail -n
or
sort <sort-options> | head -n
I am proposing to name this mutually exclusive options
--head n prints only the first n lines of the sorted output
--tail n prints only the last n lines of the sorted output
The goal is to avoid `sort' the cost of sorting the rest of the inputs.
It is indeed useless and costly to sort 100000 lines if one just needs the
10 first/last.
My implementation idea is to use a heap of n entries, filling it with the
first n inputs, sorted such that the top of the heap is the nth or the nth
to last in the current virtual output, and then comparing the following
inputs with the top of the heap, either discarding the new input,
or removing the top of the heap and inserting the new input in the heap.
At the end of all inputs we are left with exactly n inputs half-sorted in
a heap, that we must thus finish to sort.
I have done some speed tests with an adhoc strcmp-only sort program and
on a Pentium(R) 4 2.40GHz, I get e.g. :
$ ls -U | wc -l
109694
$ time ls -U | /tmp/lastn -1000 > /dev/null
real 0m0.437s
user 0m0.136s
sys 0m0.098s
$ time ls -U | sort | tail -1000 > /dev/null
real 0m5.252s
user 0m1.759s
sys 0m0.107s
Do you think that this new option would fit easily in the current `sort'
implementation, and how ?
Best Regards
Philippe
Philippe De Muyter phdm at macqel dot be Tel +32 27029044
Macq Electronique SA rue de l'Aeronef 2 B-1140 Bruxelles Fax +32 27029077

**sort suggestion : --head/tail n option**,
*Philippe De Muyter* **<=**