[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?

From: Paolo Bonzini
Subject: Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?
Date: Thu, 27 Jun 2013 10:48:45 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130514 Thunderbird/17.0.6

Il 27/06/2013 09:33, Aharon Robbins ha scritto:
> Hi Paolo.
>> > I still believe that there is no place other than the glibc locale
>> > descriptions where this can be fixed.
> This is necessary but not sufficient. All of gawk, grep, sed and bash
> run on lots of non-GLIBC systems.

On non-glibc systems they use gnulib's regex implementation, so they're

> The locale definitions, even for
> the same locale, vary wildly out in the wild.  Therefore there's no
> other practical choice but to fix each program to provide Rational
> Range Interpretation.
> Fortunately, gawk and grep are already there, and I think the sed in
> the git repo is as well.  Once Bash turns this on as default, the
> world will definitely be a better place, independent of GLIBC.

I already explained this multiple times how this is completely delusional.

1) grep, sed, coreutils and so on will only use representation-based
range interpretation (I prefer this more neutral term that also explains
what's going on) if you use gnulib's regex implementation.  And by
default, they use glibc (I just checked grep).

2) Even if you switched the default, you would be at the mercy of
distros.  Distros prefer to avoid glibc replacements in single packages,
because then all bugs have to be fixed in many different places.  In
fact, I checked grep and Fedora builds it with --without-included-regex.

Not to mention how this is entirely Latin-centric.  There are some
encodings in which there is absolutely no relation between the encoding
and the expected collation order.  For example, the pre-Unicode Cyrillic
encoding KOI8 has a phonetic encoding where for example the Cyrillic Г
is placed 128 places above the Latin G.  This was done so that the text
remained human-readable when the leftmost bit is stripped (thus
"forcefully" converting it to ASCII), and so that one could read the
text even without having Cyrillic fonts.

Ergo, the only way to get your desired range interpretation is to fix
not even glibc regex, but rather the glibc locale definitions.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]