bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#11621: questionable locale sorting order (especially as related to c


From: Linda A. Walsh
Subject: bug#11621: questionable locale sorting order (especially as related to char ranges in REs)
Date: Wed, 06 Jun 2012 18:16:02 -0700
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.24) Gecko/20100228 Lightning/0.9 Thunderbird/2.0.0.24 Mnenhy/0.7.6.666

Pádraig Brady wrote:
On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
Pádraig Brady wrote:
On 06/03/2012 11:13 PM, Linda Walsh wrote:
Within in the past few years, use of ranges in RE's has become
unreliable due to some locale changes sorting their native character
sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).

There seems to be a problem in when a user has set their system to use
Unicode, it is no longer using the locale specific character set (iso-8859-x,
or others).
----
    To clarify my above statement:


   There seems to be a problem in when a user has set their system to use
Unicode: It is no longer using the locale specific character set (iso-8859-x,
or others) -- ***or*** *their* *orderings*.  I.e. Unicode defines a collation
order -- I don't know that they others do ('C' does, but I don't know about
other locale-specific character sets).


It's not specific to "unicode". Sorting in a iso-8859-1 charset
results in locale ordering:
----
    Can you cite a source specifying the sort/collation order of the
iso-8859-1 charset that would prove that it is not-conforming to the collation 
specification for that charset?

    I.e. If there is no official source, then the order with that charset
is "undefined", and while it may not be desirable, returning a<A<b<B, would not be 
"an error".

It's a charset. Of course the order is defined. Try: man iso-8859-1

The relative ordering can be trivially inferred from the command I presented.
But to be explicit:

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US [sic] sort | iconv 
-f iso-8859-1
a
A
á
b
----
Your example doesn't show the collation order of iso-8859-1. You are setting it to 'en_US' (as LC_ALL overrides all other LC vars; LANG sets the default, but individual settings in the LC variables can override it.

A corrected example:

$ (Charset=iso-8859-1; printf "%s\n" A b B a á | iconv -t $Charset | LANG=en_US LC_CHARSET=$Charset LC_COLLATE=$Charset sort | iconv -f $Charset |tr "\n" " ";echo "") A B a b á

(I used 'Charset' to hold the charset name, added parens, printed them in the same orientation as input, and added a 2nd capital letter to make upper/lower case ordering clear.)

I might note how "trivial" it was to arrive at incorrect output. People often think me a pain because I ask them to explain what they perceive to be obvious. Unfortunately, what is obvious to 1 person may not be so to another.

The 'á' is not ASCII (original charset for C locale, coming from unix & C programming language -- a reason why POSIX renamed the 'C' local to the POSIX
locale.

However, as 'á' is in the 1st 256 chars (above the ASCII range), it can still work if you remove the iconv stuff (and note, I have no other locale vars
set:

$ echo ${!LC_*} ${!LAN*}
LC_COLLATE LC_CTYPE

$ (Charset=ASCII; printf "%s\n" A B b a á | LC_CHARSET=$Charset LC_COLLATE=$Charset sort |tr "\n" " ";echo "") A B a b á

   To bring this to completion -- most linux systems today use the UTF-8
character set. It shows an *identical* collation order for the above chars as the iso-8859-1 charset.

It appears that the collating functions are confused by the notation that has been adopted in many distributions...namely <locale>.charset. In such a notation, where the charset has been explicitly specified, and where the charset has explicit COLLATION and case folding rules (those for Unicode are extensive and handle accents as well as other forms like ſȘșʂȿᵴᶊṠṡṢṣṤṥṦṧṨṩẛẜẝẞⱾꞨꞩSsߌśŜŝŞşŠšˢ...etc.

Therefore, I would like to see the character set's collation and folding rules used where they are officially specified (as in the case of Unicode or POSIX).

   Are you the person responsible for the libicuXXX files?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]