bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#40226: sort: expected sort order when -c in use


From: Eric Blake
Subject: bug#40226: sort: expected sort order when -c in use
Date: Wed, 25 Mar 2020 13:17:19 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0

On 3/25/20 12:37 PM, Richard Ipsum wrote:
Hi,

I'm trying to understand something and thought it would be good to ask
here.

I get different results for a case-insensitive sort using -c. My
understanding is that -f should lead to lower case characters with upper
case equivalents being converted to their upper case equivalents. This
doesn't seem to be happening for the C locale though.

% echo -e "aaaa\nAAAA" | LC_COLLATE=en_GB.UTF-8 sort -c -f -
% echo -e "aaaa\nAAAA" | LC_COLLATE=en_US.UTF-8 sort -c -f -
% echo -e "aaaa\nAAAA" | LC_COLLATE=C sort -c -f -
sort: -:2: disorder: AAAA

First, 'echo -e' is not portable, so I'll be reproducing your example with printf. And you are assuming that LC_ALL is not set (otherwise, LC_COLLATE would have no impact); so I'll set LC_ALL to be sure. Except that I can't reproduce your example (I'm using Fedora 31, coreutils 8.31):

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f -
sort: -:2: disorder: AAAA

So there's probably something different in the locale libraries and/or your coreutils version on your system, compared to mine.

Next, let's debug things to see why:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f - --debug
sort: options '-c --debug' are incompatible

Oh, bummer - I don't know why we have that restriction. Okay, let's try a slightly different approach:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
AAAA
____
____
aaaa
____
____
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug -s
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
aaaa
____
AAAA
____

See the difference? In the first case, sort is doing its default case-insensitive comparison of the entire line (because you passed -f but not -k), AND a stability comparison of the byte values of the entire line (as shown by the two ____ lines per input). But in the second case, when you add -s, the stability comparison is omitted. The two lines are indeed different when the stability comparison is performed, explaining why -c choked when -s is absent. Or put another way, -f affects only -k, including the implied -k1 when you don't specify anything, and not -s. So now that we know that, let's return to your example:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - -c -s
$ echo $?
0



Is this considered a bug or an expected difference between the locales?

I don't know if it's the locale definition, or something changed between coreutils versions, or both; although I'm more likely to chalk it up to locale issues and not something where coreutils needs a patch, other than perhaps a documentation patch. I'll leave the bug report itself open for a bit longer, in case anyone else has an opinion.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org






reply via email to

[Prev in Thread] Current Thread [Next in Thread]