bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17188: Sort bugs


From: Eric Blake
Subject: bug#17188: Sort bugs
Date: Sat, 05 Apr 2014 06:21:33 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

tag 17188 notabug
thanks

On 04/04/2014 08:07 PM, Nikos Balkanas wrote:
> Hi,
> 
> Sort is seriously bugged. This is the output from:
> 
> sort -d -t \t -k1 input > out

-d says to do a dictionary sort that ignores non-alphanumeric
characters.  But it still leaves it up to your current locale on whether
those non-alpha characters are collated case-insensitively.

Also, '-k1' is almost always wrong - you generally want '-k1,1' if you
want to sort by JUST the first field, rather than by the whole line.

See the FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

> 
> 0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
> 000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
> 000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
> 00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
> 000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
> 
> Shouldn't 00/0 be first according to Ascii code?

Only if you are asking for a full ASCII sort.  Here, I'm adding -s for
fewer lines, but using --debug can sometimes help show you where you are
asking sort to do something different than you expected, but where sort
is behaving correctly given what you asked it to do.

I'm guessing your default locale is en_US.UTF-8 - because I get the same
results as you in that mode:

$ sort --debug -s -d -t \t -k1 input
sort: using ‘en_US.UTF-8’ sorting rules
0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
___________________________________________________________________
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
______________________________________________________________________
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
__________________________________________________________________
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
__________________________________________________________________
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
___________________________________________________________________

In this mode, '000p' collates case-insensitively before '000Q', so the
sort is correct (the collation was on '000Q' and not '00/0Q' because you
used -d).  Furthermore, if you omit -d:

$ sort --debug -s -t \t -k1 input
sort: using ‘en_US.UTF-8’ sorting rules
0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
___________________________________________________________________
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
______________________________________________________________________
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
__________________________________________________________________
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
__________________________________________________________________
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
___________________________________________________________________

No change, because the en_US.UTF-8 locale implicitly does a dictionary
collation even without you requesting -d.

Now, compare to the C locale, which forces sorting by byte value for
more traditional ASCII sorting:


$ LC_ALL=C sort --debug -s -d -t \t -k1 input
sort: using simple byte comparison
0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
___________________________________________________________________
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
______________________________________________________________________
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
__________________________________________________________________
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
___________________________________________________________________
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
__________________________________________________________________


'000Q' now sorts before '000R' which sorts before '000p' as expected.
And toss out the -d, and you get:

$ LC_ALL=C sort --debug -s -t \t -k1 input
sort: using simple byte comparison
00/0QwzaXrqGHXW7mE9Le8IIVgHoZvccgGydKdzJgh8.SZenbULmIWMtrGShz24W7T
__________________________________________________________________
0009rN2S3cohd2DGH6yuTWBoeuq6DwWZhCBDEnFzYqpw984FfALy7NUhEZH1.YEbiq/
___________________________________________________________________
000EMQeKUjtyXIOaUkT.XE6SaBIdOqTA0nffF394V6tkcVdup2c3ihi7yhbuRof2Y5agTG
______________________________________________________________________
000R2cnZ8.khe1eXDERclkbXASRQeKvcNBaCJRLX617Xvmff0KaoZSSFBNhNG1OiIyr
___________________________________________________________________
000p8kXIz5Tc1GaxYYXjAfgm7YJOZvyBJxVXMi0lhaJXT22IdDbE6vVhWXW9FkRBxQ
__________________________________________________________________


Now '00/' sorts before '000'.

It might be a nice improvement to the --debug output to avoid putting _
under any character that sort ignored due to -d before calling strcoll()
(which would help the output of the LC_ALL=C case, but not the
en_US.UTF-8 case) - but that's probably difficult to implement.

> 
> Plz fix.

There's nothing to fix but your usage pattern.  So I'm closing this as
not a bug.  But feel free to reply further if you still have questions.


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]