coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: question about behavior of sort -n -t,


From: Eric Blake
Subject: Re: question about behavior of sort -n -t,
Date: Tue, 08 Oct 2013 16:28:16 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130923 Thunderbird/17.0.9

On 10/08/2013 03:18 PM, Gabriel Gaster wrote:
> Hello all,
> 
> I have a question about the behavior of sort -n.
> 
> The premise of the question I asked on stackoverflow here 
> (http://stackoverflow.com/questions/19228968/unix-sort-n-t-gives-unexpected-result)
>  

Rather than make us chasing a link, you could have posted an actual
example here:

$ cat example.csv  # here's a small example
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

$ cat example.csv | sort -n --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

By the way, if you use 'sort --debug', you'll learn a lot more about
what sort is actually doing:

$ cat <<\EOF | LC_ALL=C sort -n --debug --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035
EOF

sort: using simple byte comparison
12,13.080339685
__
_______________
12,14.1531049905
__
________________
12,26.7613447051
__
________________
12,50.4592437035
__
________________
58,1.49270399401
__
________________
59,0.000192136419373
__
____________________
59,0.00182092924724
__
___________________
59,1.49270399401
__
________________
60,0.00182092924724
__
___________________
60,1.49270399401
__
________________


In the C locale, a numeric sort stops at the first non-numeric
character, and since the C locale does not have thousand's separators,
it stops at the comma.



$ cat <<\EOF | sort -n --debug --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035
EOF

sort: using ‘en_US.UTF-8’ sorting rules
58,1.49270399401
________________
________________
59,0.000192136419373
____________________
____________________
59,0.00182092924724
___________________
___________________
59,1.49270399401
________________
________________
60,0.00182092924724
___________________
___________________
60,1.49270399401
________________
________________
12,13.080339685
_______________
_______________
12,14.1531049905
________________
________________
12,26.7613447051
________________
________________
12,50.4592437035
________________
________________

In the en_US.UTF-8 locale, thousands separators exist, so the numeric
parser keeps on going until the first non-numeric character (yeah, you
aren't really using comma as a thousands separator, but such is life).


And finally, look what happens when you explicitly tell sort to quit
looking after the boundary of the first field, rather than the implied
-k1 which looks starting at the first field until a non-numeric character:

$ cat <<\EOF | sort -n -k1,1 --debug --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035
EOF

sort: using ‘en_US.UTF-8’ sorting rules
12,13.080339685
__
_______________
12,14.1531049905
__
________________
12,26.7613447051
__
________________
12,50.4592437035
__
________________
58,1.49270399401
__
________________
59,0.000192136419373
__
____________________
59,0.00182092924724
__
___________________
59,1.49270399401
__
________________
60,0.00182092924724
__
___________________
60,1.49270399401
__
________________




> 
> Can someone shed more light into this ? I'm also not sure if there is an 
> existing conversation about this,

Yes, it's a FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

and sort is doing what POSIX behaves for your particular machine's
definitions of locales, and in turn their description of how collation
and numeric parsing will perform in that locale.  Except for the C
locale, different vendors have tended to have different rules, even for
locales that are otherwise named the same.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]