bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9740: Bug in sort


From: Eric Blake
Subject: bug#9740: Bug in sort
Date: Wed, 12 Oct 2011 13:02:30 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110928 Fedora/3.1.15-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.4 Thunderbird/3.1.15

tag 9740 notabug
thanks

On 10/12/2011 12:41 PM, Lluís Padró wrote:

I found a bug in the "sort" utility that happens under utf8 locales, though
no character beyond basic ascii is involved in it...

Thanks for the report; however, this is almost certainly a case of your locale defining a different collation order than what you were expecting. See the FAQ:
https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021


I'm using "sort (GNU coreutils) 7.4" from package
"coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS

The latest version of coreutils, 8.14, includes a --debug option that makes it even more apparent why sort is behaving correctly:

## Let's try another locale
~$ export LC_ALL="en_US.UTF-8"

## Sort fails. Shorter words are sorted after longer words with the same
prefix.
~$ sort testfile
abcd Z
abce Z
abc Z
ab Z

$ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug
sort: using `en_US.UTF-8' sorting rules
abcd Z
______
abce Z
______
abc Z
_____
ab Z
____

So, what exactly is sort comparing? The entire line (because you didn't specify any -k options to limit it to fields). And how does it do the comparison? By strcoll("abcd Z", "abc Z"). And how does strcoll() behave in the en_US.UTF-8 locale? By dictionary collation - that is, case and punctuation (including space) are ignored. So you get the same answer for both strcoll("abcd Z", "abc Z") and for strcoll("abcdz", "abcz") in that locale, and sure enough, d comes before z, so the sort is correct.

You already figured out that LC_ALL=C forces sorting to honor byte values. But if you insist on using en_US collation, then maybe you should also look at forcing the sort to honor specific fields:

$ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2
sort: using `en_US.UTF-8' sorting rules
ab Z
__
   _
abc Z
___
    _
abcd Z
____
     _
abce Z
____
     _


--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org





reply via email to

[Prev in Thread] Current Thread [Next in Thread]