bug#11621: questionable locale sorting order (especially as related to c

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#11621: questionable locale sorting order (especially as related to c

From:	Linda A. Walsh
Subject:	bug#11621: questionable locale sorting order (especially as related to char ranges in REs)
Date:	Wed, 06 Jun 2012 18:16:02 -0700
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.8.1.24) Gecko/20100228 Lightning/0.9 Thunderbird/2.0.0.24 Mnenhy/0.7.6.666

Pádraig Brady wrote:

On 06/04/2012 06:03 AM, Linda A. Walsh wrote:

Pádraig Brady wrote:

On 06/03/2012 11:13 PM, Linda Walsh wrote:

Within in the past few years, use of ranges in RE's has become
unreliable due to some locale changes sorting their native character
sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).

There seems to be a problem in when a user has set their system to use
Unicode, it is no longer using the locale specific character set (iso-8859-x,
or others).

----
    To clarify my above statement:


   There seems to be a problem in when a user has set their system to use
Unicode: It is no longer using the locale specific character set (iso-8859-x,
or others) -- ***or*** *their* *orderings*.  I.e. Unicode defines a collation
order -- I don't know that they others do ('C' does, but I don't know about
other locale-specific character sets).

It's not specific to "unicode". Sorting in a iso-8859-1 charset
results in locale ordering:

----
    Can you cite a source specifying the sort/collation order of the
iso-8859-1 charset that would prove that it is not-conforming to the collation 
specification for that charset?

    I.e. If there is no official source, then the order with that charset
is "undefined", and while it may not be desirable, returning a<A<b<B, would not be 
"an error".


It's a charset. Of course the order is defined. Try: man iso-8859-1

The relative ordering can be trivially inferred from the command I presented.
But to be explicit:

$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US [sic] sort | iconv 
-f iso-8859-1
a
A
á
b

----

Your example doesn't show the collation order of iso-8859-1. You aresetting it to 'en_US' (as LC_ALL overrides all other LC vars; LANG setsthe default, but individual settings in the LC variables can override it.


A corrected example:

$ (Charset=iso-8859-1; printf "%s\n" A b B a á | iconv -t $Charset |LANG=en_US LC_CHARSET=$Charset LC_COLLATE=$Charset sort | iconv -f$Charset |tr "\n" " ";echo "")A B a b á

(I used 'Charset' to hold the charset name, added parens, printed themin the same orientation as input, and added a 2nd capital letter to makeupper/lower case ordering clear.)

I might note how "trivial" it was to arrive at incorrect output.People often think me a pain because I ask them to explain what theyperceive to beobvious. Unfortunately, what is obvious to 1 person may not be so toanother.

The 'á' is not ASCII (original charset for C locale, coming fromunix & C programming language -- a reason why POSIX renamed the 'C'local to the POSIX

locale.

However, as 'á' is in the 1st 256 chars (above the ASCII range), itcan still work if you remove the iconv stuff (and note, I have no otherlocale vars

set:

$ echo ${!LC_*} ${!LAN*}
LC_COLLATE LC_CTYPE

$ (Charset=ASCII; printf "%s\n" A B b a á | LC_CHARSET=$CharsetLC_COLLATE=$Charset sort |tr "\n" " ";echo "")A B a b á


   To bring this to completion -- most linux systems today use the UTF-8

character set. It shows an *identical* collation order for the abovechars as the iso-8859-1 charset.

It appears that the collating functions are confused by the notationthat has been adopted in many distributions...namely <locale>.charset.In such a notation, where the charset has been explicitly specified, andwhere the charset has explicit COLLATION and case folding rules (thosefor Unicode are extensive and handle accents as well as other forms likeſȘșʂȿᵴᶊṠṡṢṣṤṥṦṧṨṩẛẜẝẞⱾꞨꞩSsßŚśŜŝŞşŠšˢ...etc.

Therefore, I would like to see the character set's collation andfolding rules used where they are officially specified (as in the caseof Unicode or POSIX).


   Are you the person responsible for the libicuXXX files?

[Prev in Thread]

Current Thread

[Next in Thread]

bug#11621: questionable locale sorting order (especially as related to char ranges in REs), Linda Walsh, 2012/06/03
- bug#11621: questionable locale sorting order (especially as related to char ranges in REs), Pádraig Brady, 2012/06/03
  - bug#11621: questionable locale sorting order (especially as related to char ranges in REs), Linda A. Walsh, 2012/06/04
    - bug#11621: questionable locale sorting order (especially as related to char ranges in REs), Pádraig Brady, 2012/06/04
    - bug#11621: questionable locale sorting order (especially as related to char ranges in REs), Linda A. Walsh <=

Prev by Date: bug#11631: closed (Re: bug#11631: Head command does not position file pointer correctly for negative line count)
Next by Date: bug#11631: closed (Re: bug#11631: Head command does not position file pointer correctly for negative line count)
Previous by thread: bug#11621: questionable locale sorting order (especially as related to char ranges in REs)
Next by thread: bug#11631: Head command does not position file pointer correctly for negative line count
Index(es):
- Date
- Thread