coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Request for feature, sort


From: David Mathog
Subject: Request for feature, sort
Date: Thu, 01 Nov 2018 15:28:01 -0700
User-agent: Roundcube Webmail/0.9.5

Greetings,

I would like to request a "pre-sorted" (p?) modifier for keys. I frequently sort very large files with one or two of the sort keys already in the correct order as a consequence of how those files were generated. Sort currently has no way of accepting this information so that

  sort -k 1,1n -k 2,2 -k 3,3 <infile >outfile

pulls the entire file into a buffer and sorts it. That is slow, takes up a lot of memory, and is mostly a waste of CPU cycles. The suggested syntax of:

  sort -k 1,1np -k 2,2p -k 3,3 <infile >outfile

would only sort those small groups of lines where indicated presorted keys are all the same. In typical usage cases the majority of such groups have only 1 or a small number of members. The single line groups need not be sorted at all, just emitted, and the other groups would sort quickly. Memory requirements would typically be greatly reduced as well - a 100M line file might have a maximum group size of only 10.

This was discussed previously in this thread:

https://superuser.com/questions/1297157/optimally-sort-a-dataset-of-n-keys-when-first-key-is-already-ordered

where external methods to chop up the input and spoon feed it to multiple runs of sort are shown. None of these are going to be as fast as it would be if sort itself handled this.

Of course if the input data is not in the expected order that would be detected and a fatal error would occur, with the program exiting with an error status. A partial output file may be created in that case. If that partial file would be a problem the issue can be avoided by running sort with its (existing) "check" option using only the pre-sorted keys.

Thank you,

David Mathog
address@hidden
Manager, Sequence Analysis Facility, Biology Division, Caltech



reply via email to

[Prev in Thread] Current Thread [Next in Thread]