[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Does sort handle -t / correctly
From: |
Ray Dillinger |
Subject: |
Re: Does sort handle -t / correctly |
Date: |
Fri, 17 Apr 2015 12:53:12 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.6.0 |
I have compiled my own 'sort' which deliberately ignores
locale, (more precisely deliberately uses the 'C' locale
by default) for exactly this reason. I don't want to
screw with an environment variable that affects dozens
of things just to get sort to work predictably.
A while ago I offered a patch to put a -locale argument
in sort (so I could alias it to 'C' locale) but it was
rejected by the maintainers. So, screw it, I made my
own.
Some political committee or other has banned simple,
efficient, predictable semantics, so it won't be in any
distribution any time soon. But if you value simple
efficient predictable semantics, I suggest you do the
same.
Bear
On 04/17/2015 09:26 AM, Eric Blake wrote:
> On 04/17/2015 10:10 AM, Peng Yu wrote:
>> Hi, I got the following results when I call sort with -t /. It seems
>> that 'a/1.txt' should be right after 'a'. Is it the case? Or I am not
>> using sort correctly?
>
> Your assumption is correct - you are using sort incorrectly, by failing
> to take locales into account, and by failing to limit the amount of data
> being compared to single field widths.
>
>>
>> $ printf '%s\n' a 'a!' ab aB a/1.txt | sort -t / -k 1 -k 2 -k 3 -k 4
>> a
>> a!
>> a/1.txt
>> aB
>> ab
>
> sort --debug is your friend:
>
> $ printf '%s\n' a 'a!' ab aB a/1.txt | sort --debug -t / -k 1 -k 2 -k 3 -k 4
> sort: using ‘en_US.UTF-8’ sorting rules
> a
> _
> ^ no match for key
> ^ no match for key
> ^ no match for key
> _
> a!
> __
> ^ no match for key
> ^ no match for key
> ^ no match for key
> __
> a/1.txt
> _______
> _____
> ^ no match for key
> ^ no match for key
> _______
> ab
> __
> ^ no match for key
> ^ no match for key
> ^ no match for key
> __
> aB
> __
> ^ no match for key
> ^ no match for key
> ^ no match for key
> __
>
>
> As shown in the debug trace, the line 'a!' sorts prior to the line
> 'a!1.txt' because your first sort key is the entire line, and in the
> locale you are using (where both '!' and '/', and also '.', are ignored
> in collation orders), the collation string "a" really does come before
> "a1txt".
>
> What you REALLY want is to limit your sorting to a single field at a
> time (-k1,1 rather than -k), as in:
>
> $ printf '%s\n' a 'a!' ab aB a/1.txt | sort --debug -t / -k 1,1 -k 2,2
> sort: using ‘en_US.UTF-8’ sorting rules
> a
> _
> ^ no match for key
> _
> a/1.txt
> _
> _____
> _______
> a!
> __
> ^ no match for key
> __
> ab
> __
> ^ no match for key
> __
> aB
> __
> ^ no match for key
> __
>
>
> Or additionally, to limit your sorting to a locale that does not discard
> punctuation as unimportant, as in:
>
> $ printf '%s\n' a 'a!' ab aB a/1.txt | LC_ALL=C sort --debug -t / -k 1,1
> -k 2
> sort: using simple byte comparison
> a
> _
> ^ no match for key
> _
> a/1.txt
> _
> _____
> _______
> a!
> __
> ^ no match for key
> __
> aB
> __
> ^ no match for key
> __
> ab
> __
> ^ no match for key
> __
>
>
signature.asc
Description: OpenPGP digital signature