bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12295: sort 5.97 vs. LANG (RHEL5)


From: Pádraig Brady
Subject: bug#12295: sort 5.97 vs. LANG (RHEL5)
Date: Tue, 28 Aug 2012 23:57:32 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0

tag 12295 + notabug
close 12295
stop

more info below...

On 08/28/2012 04:24 PM, Lubos Kaspar wrote:
> Dear GNU Coreutils Developers,
> 
> I may have found a bug in GNU sort 5.97 as ported into RHEL 5.8:
> 
> : $ cat /etc/redhat-release; uname -sr
> : Red Hat Enterprise Linux Client release 5.8 (Tikanga)
> : Linux 2.6.18-308.1.1.el5
> 
> : $ sort --version
> : sort (GNU coreutils) 5.97
> : Copyright (C) 2006 Free Software Foundation, Inc.
> : This is free software.  You may redistribute copies of it under the terms of
> : the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
> : There is NO WARRANTY, to the extent permitted by law.
> : 
> : Written by Mike Haertel and Paul Eggert.
> 
> : $ man sort|grep bug
> :        Report bugs to <address@hidden>.
> 
> It comes using LANG=cs_CZ.iso88592 (and verified also for
> LANG=cs_CZ.utf8 and e.g. for LANG=de_DE.iso88591 or for
> simple LANG=en_US, too) even when using only US-ASCII characters.
> 
> Let me give you a very simple example when sorting some surnames
> concatanated by a minus (hyphen) with related first name initials
> ('Novak' is the most common Czech surname and 'Novakova' is
> a modified form used for women):
> 
> : $ cat x                     #content origin in reverse order than wanted
> : Novakova-V
> : Novak-P
> : Novak-L
> : Novak-J
> 
> : $ LANG= sort x              #sort it without LANG setting (expected result)
> : Novak-J
> : Novak-L
> : Novak-P
> : Novakova-V
> 
> : $ LANG=C sort x             #sort it with LANG=C setting (expected result)
> : Novak-J
> : Novak-L
> : Novak-P
> : Novakova-V
> 
> : $ LANG=cs_CZ.iso88592 sort x        #sort it using usual locale (odd result)
> : Novak-J
> : Novak-L
> : Novakova-V
> : Novak-P
> 
> The same results can be obtained e.g. for using dot as a separator
> instead of minus (hyphen). No matter using -d and/or -f and/or -s, too.
> 
> Of course, it could be quite easily 'workarounded' in this case, e.g.:
> 
> : $ LANG=cs_CZ.iso88592 sort -t- -k1,1 -k2,2 x
> : Novak-J
> : Novak-L
> : Novak-P
> : Novakova-V
> 
> but it is probably impossible to do it commonly.
> 
> Unfortunately it is also generally impossible to use LANG= or LANG=C
> as some sets of data require proper sorting respecting local traditions
> (e.g. to rank 'ch' between 'h' and 'i', not between 'cg' and 'ci',
> consonants with carons after those without carons etc.) which should
> work just using LANG=cs_CZ.
> 
> If it is not a bug it would be very kind of you to send me some
> explanation and an advice how to use 'sort' to get regular results.
> In such a case please accept my deep apologies for disturbing you.
> 
> Thank you very much for your attention and understanding.

Thanks for the detailed report.
However this just seems like a case of:
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
Quoting from there:

"Most of the language specific locales have tables that specify the sort 
behavior to
ignore punctuation and to fold case. This is counter intuitive to most long 
time computer users!"

This minimal reproducer shows the same
behaviour in the en_US locale:

$ printf '%s\n' xo-V x-P x-L x-J | LC_ALL=en_US sort

Yes this is daft default behavior, and your workaround
seems like the best option for now.

Perhaps in future we will be able to support more
fine grained control over the sorting order.

cheers,
Pádraig.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]