[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Request for feature, sort
From: |
Pádraig Brady |
Subject: |
Re: Request for feature, sort |
Date: |
Sun, 4 Nov 2018 00:45:21 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
On 01/11/18 15:28, David Mathog wrote:
> Greetings,
>
> I would like to request a "pre-sorted" (p?) modifier for keys. I
> frequently sort very large files with one or two of the sort keys
> already in the correct order as a consequence of how those files were
> generated. Sort currently has no way of accepting this information so
> that
>
> sort -k 1,1n -k 2,2 -k 3,3 <infile >outfile
>
> pulls the entire file into a buffer and sorts it. That is slow, takes
> up a lot of memory, and is mostly a waste of CPU cycles. The suggested
> syntax of:
>
> sort -k 1,1np -k 2,2p -k 3,3 <infile >outfile
>
> would only sort those small groups of lines where indicated presorted
> keys are all the same. In typical usage cases the majority of such
> groups have only 1 or a small number of members. The single line groups
> need not be sorted at all, just emitted, and the other groups would sort
> quickly. Memory requirements would typically be greatly reduced as well
> - a 100M line file might have a maximum group size of only 10.
>
> This was discussed previously in this thread:
>
> https://superuser.com/questions/1297157/optimally-sort-a-dataset-of-n-keys-when-first-key-is-already-ordered
>
> where external methods to chop up the input and spoon feed it to
> multiple runs of sort are shown. None of these are going to be as fast
> as it would be if sort itself handled this.
>
> Of course if the input data is not in the expected order that would be
> detected and a fatal error would occur, with the program exiting with an
> error status. A partial output file may be created in that case. If
> that partial file would be a problem the issue can be avoided by running
> sort with its (existing) "check" option using only the pre-sorted keys.
>
> Thank you,
>
> David Mathog
> address@hidden
> Manager, Sequence Analysis Facility, Biology Division, Caltech
I like this idea.
I'm not sure how frequent this partially sorted data pattern is,
though given the complexity improvements it seems worthwhile.
So 'p' would be a modifier applicable to each key.
When the data in the key changes we could do
the full key comparison to ensure the order is correct,
and if so output all current sorted data for the just presorted key.
'p' modified keys would need to come before any non 'p' keys.
cheers,
Pádraig