bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12295: sort 5.97 vs. LANG (RHEL5)


From: Lubos Kaspar
Subject: bug#12295: sort 5.97 vs. LANG (RHEL5)
Date: Tue, 28 Aug 2012 17:24:09 +0200

Dear GNU Coreutils Developers,

I may have found a bug in GNU sort 5.97 as ported into RHEL 5.8:

: $ cat /etc/redhat-release; uname -sr
: Red Hat Enterprise Linux Client release 5.8 (Tikanga)
: Linux 2.6.18-308.1.1.el5

: $ sort --version
: sort (GNU coreutils) 5.97
: Copyright (C) 2006 Free Software Foundation, Inc.
: This is free software.  You may redistribute copies of it under the terms of
: the GNU General Public License <http://www.gnu.org/licenses/gpl.html>.
: There is NO WARRANTY, to the extent permitted by law.
: 
: Written by Mike Haertel and Paul Eggert.

: $ man sort|grep bug
:        Report bugs to <address@hidden>.

It comes using LANG=cs_CZ.iso88592 (and verified also for
LANG=cs_CZ.utf8 and e.g. for LANG=de_DE.iso88591 or for
simple LANG=en_US, too) even when using only US-ASCII characters.

Let me give you a very simple example when sorting some surnames
concatanated by a minus (hyphen) with related first name initials
('Novak' is the most common Czech surname and 'Novakova' is
a modified form used for women):

: $ cat x                       #content origin in reverse order than wanted
: Novakova-V
: Novak-P
: Novak-L
: Novak-J

: $ LANG= sort x                #sort it without LANG setting (expected result)
: Novak-J
: Novak-L
: Novak-P
: Novakova-V

: $ LANG=C sort x               #sort it with LANG=C setting (expected result)
: Novak-J
: Novak-L
: Novak-P
: Novakova-V

: $ LANG=cs_CZ.iso88592 sort x  #sort it using usual locale (odd result)
: Novak-J
: Novak-L
: Novakova-V
: Novak-P

The same results can be obtained e.g. for using dot as a separator
instead of minus (hyphen). No matter using -d and/or -f and/or -s, too.

Of course, it could be quite easily 'workarounded' in this case, e.g.:

: $ LANG=cs_CZ.iso88592 sort -t- -k1,1 -k2,2 x
: Novak-J
: Novak-L
: Novak-P
: Novakova-V

but it is probably impossible to do it commonly.

Unfortunately it is also generally impossible to use LANG= or LANG=C
as some sets of data require proper sorting respecting local traditions
(e.g. to rank 'ch' between 'h' and 'i', not between 'cg' and 'ci',
consonants with carons after those without carons etc.) which should
work just using LANG=cs_CZ.

If it is not a bug it would be very kind of you to send me some
explanation and an advice how to use 'sort' to get regular results.
In such a case please accept my deep apologies for disturbing you.

Thank you very much for your attention and understanding.

Best regards,
--
Lubos Kaspar





reply via email to

[Prev in Thread] Current Thread [Next in Thread]