bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#33371: RFC: option for numeric sort: ignore-non-numeric characters


From: Erik Auerswald
Subject: bug#33371: RFC: option for numeric sort: ignore-non-numeric characters
Date: Mon, 19 Nov 2018 15:27:00 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1

Hi,

On 11/19/18 02:08, L A Walsh wrote:
On 11/14/2018 12:27 AM, Erik Auerswald wrote:
On Tue, Nov 13, 2018 at 06:32:55PM -0800, L A Walsh wrote:
I have a bunch of files numbered from 1-over 2000 without leading zeros
(think rfc's)...
They have names with a non-numeric prefix & suffix around the number.

Are prefix and suffix constant? RFC files are usually named rfc${NR}.txt.

It would be nice if sort had the option to ignore non-numeric
data and only sort on the numeric data in the 'lines'/'files'.

Perhaps --version-sort could work for you?
[...]
the 'sort -V' works by itself, works.
[...]
Or is there an options for this already, and my manpage out of date?

AFAIK not exactly.
[...]
     "-V" seems like it might be sufficient, but I doubt most
non-computer types would know that -V would sort multiple numeric fields
separated by invariant non-numeric characters in a numeric fashion
(or would even know how a version sort is the other sorts).

As far as I remember, the definition of --version-sort is to follow the Debian GNU/Linux package version sorting rules. Those are based on numbers surrounded by text, but several characters have special meaning (e.g. '~' sorts before everything else, even before the empty string). Thus this is _not_ a "natural sort," but quite specific and potentially surprising.

$ printf -- 'foo\nbar\nfoo-bar\nfoo~bar\n' | sort --version-sort
bar
foo~bar
foo
foo-bar

Given how well read docs are these days, almost need a literal definition
of 'version sort' besides just calling it a 'version sort' (which we
must admit, is 'jargon').

I think is worse than jargon, because it is specific to one kind of version numbering scheme.

Along the lines of:

   --version-sort |  -V       Sees inputs as a mix of numeric and alphabetic (or "identifier")
       fields, where the numeric fields are sorted naturally, and alpha
       fields sorted alphabetically.  Fields may have separators like
       '.', '_', or '-',  sometimes constrained by a specific computer
       language, or may have no separator at all between numeric and
      alpha fields.  This is type of sort is often called a "version sort" in the computer field.

Thus I am not sure about your suggestion above. :-/

???  I listed 'version sort' at the end, as the equivalence so those who tend to skip and read initial parts of lines/paragraphs would not just see "version sort" and gloss over the rest, inserting their own equivalence for the definition -- especially likely w/"version-sort" being the long form
of the switch.

I like that strategy. :-)

Thanks,
Erik





reply via email to

[Prev in Thread] Current Thread [Next in Thread]