bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character ranges in regular expressions


From: Eric Blake
Subject: Re: character ranges in regular expressions
Date: Fri, 24 Sep 2010 16:27:53 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.9) Gecko/20100907 Fedora/3.1.3-1.fc13 Mnenhy/0.8.3 Thunderbird/3.1.3

On 09/24/2010 03:52 PM, Bruno Haible wrote:

1) Is there an agreement of what the result should be? Jim seems to prefer to
extrapolate the result of the "C" locale, i.e. 26.

As do I.

For other people, the locale
dependent behaviour is useful, that is, 51 is desired.

Which is why my proposal is that glibc consider:

[A-Z] => match C locale; 26 letters, regardless of locale
[[.A.]-[.Z.]] => use collation rules, since we explicitly spelled things with collation symbols (26 letters in POSIX local, 51 or even more in other locales, since accented characters might be included in the collation range), so that we aren't completely losing CEO behavior (if someone seriously has a reason to use it)
[[:upper:]] => per POSIX rules in all locales

as well as:

clean up all the locale tables to make CEO consistent with strcoll, rather than having some bizarre locales like cs_CZ (the locale definition file is what determines both strcoll and CEO ordering, it's just that you can rearrange lines within a locale definition with the result of leaving strcoll unchanged but impacting CEO - so the bug in screwy locales like cs_CZ is that they didn't follow a common layout pattern in the locale definition file).

From around 2000, I
remember a mail from Ulrich Drepper where he essentially said "you have to
learn that in other locales range expressions work differently, use [[:alpha:]]
instead".

But 2000 was in the timeframe where the POSIX rules on CEO were still current; that POSIX rule was relaxed in 2001, such that POSIX itself admits that CEO has a number of short-comings, and mentions that native ordering (ie. matching the C locale) is a valid implementation option.


2) Is Ulrich aware that the subtle differences in the localedata/locales/*
files lead to bizarre behaviour of regexec() in the cs_CZ, pl_PL, etc. locales?

If he still actively reads glibc bugs, yes:

http://sourceware.org/bugzilla/show_bug.cgi?id=12045
http://sourceware.org/bugzilla/show_bug.cgi?id=12051

--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]