--- Begin Message ---
Subject: |
Sorting bug? |
Date: |
Tue, 23 Sep 2014 22:24:45 +0200 |
I discovered a behaviour of "sort" that looks like a bug to me. When
one key in the input is an initial part of another key, the shorter
key is sorted first if the key is all there is on the line. But if
there are other fields too, not included in the key, the order
changes. That is true even with the --stable flag, so "sort" seems to
consider the order of the keys different in the two cases.
I sort in a non-C locale. sv_SE.utf8 actually, but en_US.utf8 behaves
the same so I illustrate using that.
First case, the key is all there is on the line. The shorter line
gets sorted earlier, regardless of input order:
address@hidden Hämtat]$ { echo 'binutils x86_64'; echo
'binutils-x86_64-linux-gnu x86_64'; } | LANG=en_US.utf8 sort --stable --debug
--key=1,1 --field-separator=!
sort: using ‘en_US.utf8’ sorting rules
binutils x86_64
_______________
binutils-x86_64-linux-gnu x86_64
________________________________
address@hidden Hämtat]$ { echo 'binutils-x86_64-linux-gnu x86_64'; echo
'binutils x86_64'; } | LANG=en_US.utf8 sort --stable --debug --key=1,1
--field-separator=!
sort: using ‘en_US.utf8’ sorting rules
binutils x86_64
_______________
binutils-x86_64-linux-gnu x86_64
________________________________
Second case, the input lines contains a second field. Now the longer
field gets sorted earlier, regardless of input order:
address@hidden Hämtat]$ { echo 'binutils x86_64!new'; echo
'binutils-x86_64-linux-gnu x86_64!new'; } | LANG=en_US.utf8 sort --stable
--debug --key=1,1 --field-separator=!
sort: using ‘en_US.utf8’ sorting rules
binutils-x86_64-linux-gnu x86_64!new
________________________________
binutils x86_64!new
_______________
address@hidden Hämtat]$ { echo 'binutils-x86_64-linux-gnu x86_64!new'; echo
'binutils x86_64!new'; } | LANG=en_US.utf8 sort --stable --debug --key=1,1
--field-separator=!
sort: using ‘en_US.utf8’ sorting rules
binutils-x86_64-linux-gnu x86_64!new
________________________________
binutils x86_64!new
_______________
I can't see any reason for this. Is it me not understanding sorting,
or is it actually a bug?
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#18540: Sorting bug? |
Date: |
Tue, 23 Sep 2014 15:36:53 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0 |
tag 18540 notabug
thanks
On 09/23/2014 02:58 PM, Eric Blake wrote:
> Let's look further:
>
> $ printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 ltrace -e strcoll sort -s
> --debug -k1,1 -t!
> sort: using ‘en_US.utf8’ sorting rules
> sort->strcoll("a b!x", "a-b-c!x") = 21
> a-b-c!x
> _____
> a b!x
> ___
> +++ exited (status 0) +++
Hmm, I just noticed something.
>
>
> Huh? Why are we passing the ENTIRE line to strcoll? Shouldn't we only
> be passing the key?
That was my distro's build of sort (in my case, Fedora 20, with sort
from GNU coreutils 8.21). But looking at coreutils.git (v8.23-39-g1ff4d08),
$ printf 'a b!x\na-b-c!x\n' | LANG=en_US.utf8 ltrace -e strcoll
./src/sort -s --debug -k1,1 -t!
./src/sort: using ‘en_US.utf8’ sorting rules
sort->strcoll("a b", "a-b-c") = -1
a b!x
___
a-b-c!x
_____
+++ exited (status 0) +++
Yay - strcoll now uses the correct bounds. Next step - determining if
this is an upstream problem that was fixed in the interim, or if this is
a bug in the downstream additions on top of stock upstream. None of the
9 commits in 'git shortlog v8.21.. src/sort.c' seem to describe the
situation.
And looking at my distro's patches, there is definitely some gorp added
to sort.c in coreutils-i18n.patch, which I highly suspect to be the root
cause.
So please re-raise this as a downstream bug in your distro's i18n patch,
as upstream coreutils is immune.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
--- End Message ---