[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#40226: sort: expected sort order when -c in use
From: |
Eric Blake |
Subject: |
bug#40226: sort: expected sort order when -c in use |
Date: |
Wed, 25 Mar 2020 13:17:19 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 |
On 3/25/20 12:37 PM, Richard Ipsum wrote:
Hi,
I'm trying to understand something and thought it would be good to ask
here.
I get different results for a case-insensitive sort using -c. My
understanding is that -f should lead to lower case characters with upper
case equivalents being converted to their upper case equivalents. This
doesn't seem to be happening for the C locale though.
% echo -e "aaaa\nAAAA" | LC_COLLATE=en_GB.UTF-8 sort -c -f -
% echo -e "aaaa\nAAAA" | LC_COLLATE=en_US.UTF-8 sort -c -f -
% echo -e "aaaa\nAAAA" | LC_COLLATE=C sort -c -f -
sort: -:2: disorder: AAAA
First, 'echo -e' is not portable, so I'll be reproducing your example
with printf. And you are assuming that LC_ALL is not set (otherwise,
LC_COLLATE would have no impact); so I'll set LC_ALL to be sure. Except
that I can't reproduce your example (I'm using Fedora 31, coreutils 8.31):
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f -
sort: -:2: disorder: AAAA
So there's probably something different in the locale libraries and/or
your coreutils version on your system, compared to mine.
Next, let's debug things to see why:
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f - --debug
sort: options '-c --debug' are incompatible
Oh, bummer - I don't know why we have that restriction. Okay, let's try
a slightly different approach:
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
AAAA
____
____
aaaa
____
____
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug -s
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
aaaa
____
AAAA
____
See the difference? In the first case, sort is doing its default
case-insensitive comparison of the entire line (because you passed -f
but not -k), AND a stability comparison of the byte values of the entire
line (as shown by the two ____ lines per input). But in the second
case, when you add -s, the stability comparison is omitted. The two
lines are indeed different when the stability comparison is performed,
explaining why -c choked when -s is absent. Or put another way, -f
affects only -k, including the implied -k1 when you don't specify
anything, and not -s. So now that we know that, let's return to your
example:
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - -c -s
$ echo $?
0
Is this considered a bug or an expected difference between the locales?
I don't know if it's the locale definition, or something changed between
coreutils versions, or both; although I'm more likely to chalk it up to
locale issues and not something where coreutils needs a patch, other
than perhaps a documentation patch. I'll leave the bug report itself
open for a bit longer, in case anyone else has an opinion.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization: qemu.org | libvirt.org