[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

sort suggestion : --head/tail n option

From: Philippe De Muyter
Subject: sort suggestion : --head/tail n option
Date: Thu, 22 Apr 2004 18:48:49 +0200 (CEST)


I am considering adding a new option to the GNU `sort' utility :

the goal is to produce only the first or the last n lines of the sorted
output.  Thus the result should be exactly equal to

        sort <sort-options> | tail -n
        sort <sort-options> | head -n

I am proposing to name this mutually exclusive options

        --head n        prints only the first n lines of the sorted output
        --tail n        prints only the last n lines of the sorted output

The goal is to avoid `sort' the cost of sorting the rest of the inputs.
It is indeed useless and costly to sort 100000 lines if one just needs the
10 first/last.

My implementation idea is to use a heap of n entries, filling it with the
first n inputs, sorted such that the top of the heap is the nth or the nth
to last in the current virtual output, and then comparing the following
inputs with the top of the heap, either discarding the new input,
or removing the top of the heap and inserting the new input in the heap.

At the end of all inputs we are left with exactly n inputs half-sorted in
a heap, that we must thus finish to sort.

I have done some speed tests with an adhoc strcmp-only sort program and
on a Pentium(R) 4 2.40GHz, I get e.g. :

        $ ls -U | wc -l
        $ time ls -U | /tmp/lastn -1000 > /dev/null
        real    0m0.437s
        user    0m0.136s
        sys     0m0.098s
        $ time ls -U | sort | tail -1000 > /dev/null
        real    0m5.252s
        user    0m1.759s
        sys     0m0.107s

Do you think that this new option would fit easily in the current `sort'
implementation, and how ?

Best Regards


Philippe De Muyter  phdm at macqel dot be  Tel +32 27029044
Macq Electronique SA  rue de l'Aeronef 2  B-1140 Bruxelles  Fax +32 27029077

reply via email to

[Prev in Thread] Current Thread [Next in Thread]