[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?

From: Paolo Bonzini
Subject: Re: locale specific ordering in EN_US -- why is a<A<b<B<y<Y<z<Z?
Date: Fri, 28 Jun 2013 11:08:32 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130514 Thunderbird/17.0.6

Il 27/06/2013 21:13, Chet Ramey ha scritto:
> On 6/27/13 4:48 AM, Paolo Bonzini wrote:
>> Il 27/06/2013 09:33, Aharon Robbins ha scritto:
>>> Hi Paolo.
>>>>> I still believe that there is no place other than the glibc locale
>>>>> descriptions where this can be fixed.
>>> This is necessary but not sufficient. All of gawk, grep, sed and bash
>>> run on lots of non-GLIBC systems.
>> On non-glibc systems they use gnulib's regex implementation, so they're
>> fine.
> You presume much.  Bash, for instance, doesn't use a regex implementation,
> especially not gnulib's.

Does that mean you have no way to make sure that [A-Z] is sane on
non-glibc implementation?  I see sh_regmatch uses regcomp/regexec on
bash 4.2.

>>> The locale definitions, even for
>>> the same locale, vary wildly out in the wild.  Therefore there's no
>>> other practical choice but to fix each program to provide Rational
>>> Range Interpretation.
>>> Fortunately, gawk and grep are already there, and I think the sed in
>>> the git repo is as well.  Once Bash turns this on as default, the
>>> world will definitely be a better place, independent of GLIBC.
>> I already explained this multiple times how this is completely delusional.
> A little bit strong, no?  If you use your own matching code, it's a small
> matter to change strcoll to strcmp.

What about the remaining 99% that doesn't use your own matching code?
Including the bash 4.2 source I'm looking at?

>> 1) grep, sed, coreutils and so on will only use representation-based
>> range interpretation (I prefer this more neutral term that also explains
>> what's going on) if you use gnulib's regex implementation.  And by
>> default, they use glibc (I just checked grep).
>> 2) Even if you switched the default, you would be at the mercy of
>> distros.  Distros prefer to avoid glibc replacements in single packages,
>> because then all bugs have to be fixed in many different places.  In
>> fact, I checked grep and Fedora builds it with --without-included-regex.
> There are systems of interest besides Linux and its distros.

But for those the only choice is to use your own matcher if you want to
impose the semantics you desire.   These two points are only about glibc

>> Not to mention how this is entirely Latin-centric.  There are some
>> encodings in which there is absolutely no relation between the encoding
>> and the expected collation order.
> And there's no portable way to obtain this information in any case, glibc
> or not.

Which is why the right thing is to fix the locale descriptions in glibc,
and keep using glibc.  glibc is the only place where you have the
information.  This way you fix Latin and don't break anyone else.

But actually, representation-based range interpretation is broken even
for Latin because À should collate between A and B by the way.  It's
just the lowercase/uppercase order that is insane, and that's the *one*
thing that should be fixed.  In glibc.

> So if this is to be `fixed' only either by changing every locale
> definition everywhere or changing the matching code, I vote for changing
> the matching code.  We just have to agree on an interpretation and make
> sure the various matchers agree.

Again: changing the matching code doesn't help on GNU systems, where the
only matching code that is (or should be) used by GNU programs is GNU


reply via email to

[Prev in Thread] Current Thread [Next in Thread]