bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#33837: Unexpected result for regex with non-ascii range


From: Reinis Danne
Subject: bug#33837: Unexpected result for regex with non-ascii range
Date: Sun, 23 Dec 2018 23:06:40 +0200

svētd., 2018. g. 23. dec., plkst. 22:18 — lietotājs Jim Meyering
(<address@hidden>) rakstīja:
>
> tags 33873 notabug
> close 33873
> stop
>
> On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne <address@hidden> wrote:
> > grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
> > of yY for lv_LV.UTF-8 locale (by implementing rational range
> > interpretation?) [1].
> >
> > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774
> >
> > However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected 
> > results:
> > $ echo 
> > aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> > | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
> > aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
> > Ž
> > $ echo 
> > aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> > | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
> > a
> > āĀb
> > c
> > čČd
> ...
> >
> > For the uppercase the result is completely bogus, but for the lowercase 
> > range
> > it seems that accented uppercase letters are interleaved with the
> > lowercase ones.
> >
> > I would expect all letters to have their uppercase variants de-interleaved 
> > here.
> >
> > I don't know if grep alters the collation rules or it is done by glibc 
> > (2.28).
> > strxfrm() gives me this result:
> > Using LC_COLLATE=lv_LV.UTF-8
> > char    strxfrm
> > i    c2b7010201020101e29b96
> > I    c2b7010201070101e2afb7
> ...
>
> Thanks for the report. However, ...
> Using a multi-byte character as a range endpoint elicits what the
> standards documents call "unspecified behavior".
>
> Quoting grep's own manual,
>
> > Within a bracket expression, a "range expression" consists of two 
> > characters separated by a hyphen.  It matches any single character that 
> > sorts between the two characters, inclusive.  In the default C locale, the 
> > sorting sequence is the native character order; for example, '[a-d]' is 
> > equivalent to '[abcd]'.  In other locales, the sorting sequence is not 
> > specified, and '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', 
> > or it might fail to match any character, or the set of characters that it 
> > matches might even be erratic.  To obtain the traditional interpretation of 
> > bracket expressions, you can use the 'C' locale by setting the 'LC_ALL' 
> > environment variable to the value 'C'.
>
> For the record, POSIX says this:
> http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html:
>
> > Range expressions are, historically, an integral part of REs. However, the 
> > requirements of "natural language behavior" and portability do conflict. In 
> > the POSIX locale, ranges must be treated according to the collating 
> > sequence and include such characters that fall within the range based on 
> > that collating sequence, regardless of character values. In other locales, 
> > ranges have unspecified behavior.
>
> I am marking the auto-created issue as "not-a-bug", and can't even
> (reasonably) label it as "wishlist", because allowing what your usage
> implies is fundamentally contradictory.
>
> You're welcome to continue the discussion here.

Thank you for the response.

I had read that document before. I didn't realize that sorting order
and collation order are two different things, or rather that
alphabetic sorting would imply collation while sorting order the
manual was talking about refers to comparison of code point numerical
values.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]