[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: BUG in sort --numeric-sort --unique
From: |
Kaz Kylheku (Coreutils) |
Subject: |
Re: BUG in sort --numeric-sort --unique |
Date: |
Thu, 13 Feb 2020 15:32:35 -0800 |
User-agent: |
Roundcube Webmail/0.9.2 |
On 2020-02-13 14:00, Stefano Pederzani wrote:
In fact, separating the parameters:
# cat controllareARCHIVIO_2020/02/controllare20200213.txt | sort -u |
sort -n | wc -l
1262
we workaround the bug.
My own experiment shows confirms things to be reasonable.
When -n and -u are combined, then uniqueness is based no numeric
equivalence. Since numeric equivalence is weaker, de-duplication
based on numeric equivalence can cull out more records than
de-duplication based on textual equivalence.
$ printf "0\n00\n000\n" | sort -u
0
00
000
$ printf "0\n00\n000\n" | sort -n
0
00
000
$ printf "0\n00\n000\n" | sort -nu
0
$ printf "0\n00\n000\n" | sort -n | sort -u
0
00
000
$ printf "0\n00\n000\n" | sort -u | sort -n
0
00
000
As you can see, sort -nu is not equivalent to any combination
of sort -n and sort -u. sort -nu has de-duplicated a file of
different "spellings" of zero down to a single entry.
sort -u may not de-duplicate these entries because "0"
is textually different from "00".
Every line is only something like "1.2.3.4".
Unfortunately, "sort -n" will probably not do what you think with
this data.
Please read sort's GNU Info documentation; the man page lacks
detail about what numeric sorting means.
Also, the POSIX standard's description of -n:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html
In short, what -n does is recognize a *prefix* of each line as a number
according to a pattern that includes optional blanks, an optional sign,
digits, a radix character, and digit group separators.
-n does not deal with compound numeric identifiers like 1.2.3.4.
Basically 1.2.3.4 and 1.2.4.4 both look like the number 1.2.
$ sort -nu
1.2.3.4
1.2.4.4
1.2.5.6
[Ctrl-D][Enter]
1.2.3.4
Oops! This result is correct; under numeric sort (-n), all these lines
are considered to have the key 1.2. And if we de-duplicatd based on
that,
they are all considered to be duplicates; they de-duplicate down to
a single line.