bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9418: closed (Re: bug#9418: case sensitivity buggy in sort)


From: Davide Brini
Subject: bug#9418: closed (Re: bug#9418: case sensitivity buggy in sort)
Date: Fri, 2 Sep 2011 11:10:26 +0200

On Fri, 2 Sep 2011 08:46:23 +0200, Michał Janke <address@hidden> wrote:

> I definitely don't agree with "locale issue" explanation. This is not
> a problem of some letter being treated as > or < than other
> - the problem is that it is _sometimes_ one way, sometimes the other!
> Please have a closer look at this one:
> 
> $ cat aaa
> aa 1
> AA 1
> Aa 0
> 
> Now consider what should be the output of sort in two cases: A>a and A<a.
> If A>a, the result should be
> aa 1
> Aa 0
> AA 1
> 
> If A<a, it should be
> AA 1
> Aa 0
> aa 1
> 
> And now the actual result:
> 
> $ sort aaa
> Aa 0
> aa 1
> AA 1
> 
> So the lines are sorted in first place according to the second column!

I think what's happening is that you're seeing that unicode sort is
multilevel. In a nutshell (and very simplified), "A" and "a", for unicode,
are "the same base letter" and so are equivalent when compared with "1" or
"0", so the second column in your example is what determines the sort order.
Within themeselves, however, "a" sorts before "A", so that explains lines 2
and 3 of your output. Again, this is a gross oversimplification, but
hopefully gives you the idea.

A bit less simplified (but still quite far from the real thing):

- If there are any differences in base letters, that determines the result
- Otherwise, if there are any differences in accents*, that determines the
  results
- Otherwise, if there are any differences in case*, that determines the
  results
- Otherwise, if there are any differences in punctuation*, that determines
  the results

(taken from one of the pages linked below)

You may want to read 

http://www.unicode.org/reports/tr10/

and play with (for example)

http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col

(choosing the locale of your choice) to get an idea of how it works.

With your example and the US locale, the tool gives

03: Aa 0
27 27 04 12 01 08 01 8f 07 00
01: aa 1
27 27 04 14 01 08 01 08 00
02: AA 1
27 27 04 14 01 08 01 8f 8f 06 00 

which you can then interpret with the help of the document that explains
the unicode collation algorithm.

(not that I agree with any of the localization madness, but
understanding always helps).

-- 
D.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]