[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] Bug (?) in sort -R
From: |
Eric Blake |
Subject: |
Re: [coreutils] Bug (?) in sort -R |
Date: |
Mon, 16 Aug 2010 13:47:15 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.7) Gecko/20100720 Fedora/3.1.1-1.fc13 Lightning/1.0b2pre Mnenhy/0.8.3 Thunderbird/3.1.1 |
On 08/16/2010 01:22 PM, Jason wrote:
> I can't decide if this is a bug or not. Apologies if this has already been
> discussed I am pretty new to the list. I'm using the latest git version,
> 8.5.136-6d78c.
>
> If you do
>
> sort -R -k 4,4 a > b
>
> the relative ordering of column 4 is different then if you do
>
> sort -R -k 4,5 a > b.
Thanks for the report.
First, remember that if you don't use -s, then sort adds an implicit
option of '-k 1' (that is, the entire line is treated as a tie-breaker),
which can affect results.
Also remember that without -b, the amount of whitespace preceding a
field is significant to some, but not all, of your sort fields, which
will impact the string subjected to the random hashing.
>
> (obviously the actual order in the output file is different on every run
> unless you pass in the same random data to get the same ordering)
>
> It'd seem that the individual columns should be hashed and sorted
> independently in order to maintain the normal ordering of the primary sort
> column.
Nope, sort is based on the key, and if you request a key that spans two
columns, then you are hashing a different value than if you request a
key that spans one column. If you really want to hash the two columns
independently, then tell that to sort:
sort -s -R -k 4,4 -k 5,5 a
> ~/coreutils/coreutils> src/sort --version
> sort (GNU coreutils) 8.5.136-6d78c
>
> This is also true if you use the -s flag with only one field specified,
> which is a slightly different flavor of the same bug.
>
> ~/coreutils/coreutils> src/sort -s -R -k 4 a
With only one of the two key fields specified, you are asking sort to go
from that key to the end of the line. So 'sort -k 4' is different than
'sort -k 4,4', and hashing different strings.
So far, I don't think you have managed to pinpoint any bugs in sort, but
only in your usage of it. The next version of coreutils will include
the --debug option to sort, to make analysis of your input a little
easier to follow:
$ sort --debug -R -k 4,4 a
sort: using `en_US.UTF-8' sorting rules
sort: leading blanks are significant in key 1; consider also specifying `b'
a b c e e
__
_________
a b c e f
__
_________
a b c d e
__
_________
a b c d f
__
_________
a b c d g
__
_________
$ LC_ALL=C sort --debug -R -k 4,4 a
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
a b c d e
__
_________
a b c d f
__
_________
a b c d g
__
_________
a b c e e
__
_________
a b c e f
__
_________
$ LC_ALL=C sort --debug -s -R -k 4 a
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
a b c e e
____
a b c d f
____
a b c d g
____
a b c d e
____
a b c e f
____
$ LC_ALL=C sort --debug -s -R -k 4,5 a
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
a b c e f
____
a b c e e
____
a b c d e
____
a b c d f
____
a b c d g
____
$ LC_ALL=C sort --debug -s -b -R -k 4,4 -k 5,5 a
sort: using simple byte comparison
a b c d g
_
_
a b c d e
_
_
a b c d f
_
_
a b c e e
_
_
a b c e f
_
_
--
Eric Blake address@hidden +1-801-349-2682
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature