[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#33837: Unexpected result for regex with non-ascii range
From: |
Jim Meyering |
Subject: |
bug#33837: Unexpected result for regex with non-ascii range |
Date: |
Sun, 23 Dec 2018 12:17:52 -0800 |
tags 33873 notabug
close 33873
stop
On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne <address@hidden> wrote:
> grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
> of yY for lv_LV.UTF-8 locale (by implementing rational range
> interpretation?) [1].
>
> [1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774
>
> However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected
> results:
> $ echo
> aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
> aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
> Ž
> $ echo
> aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
> | LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
> a
> āĀb
> c
> čČd
...
>
> For the uppercase the result is completely bogus, but for the lowercase range
> it seems that accented uppercase letters are interleaved with the
> lowercase ones.
>
> I would expect all letters to have their uppercase variants de-interleaved
> here.
>
> I don't know if grep alters the collation rules or it is done by glibc (2.28).
> strxfrm() gives me this result:
> Using LC_COLLATE=lv_LV.UTF-8
> char strxfrm
> i c2b7010201020101e29b96
> I c2b7010201070101e2afb7
...
Thanks for the report. However, ...
Using a multi-byte character as a range endpoint elicits what the
standards documents call "unspecified behavior".
Quoting grep's own manual,
> Within a bracket expression, a "range expression" consists of two characters
> separated by a hyphen. It matches any single character that sorts between
> the two characters, inclusive. In the default C locale, the sorting sequence
> is the native character order; for example, '[a-d]' is equivalent to
> '[abcd]'. In other locales, the sorting sequence is not specified, and
> '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail
> to match any character, or the set of characters that it matches might even
> be erratic. To obtain the traditional interpretation of bracket expressions,
> you can use the 'C' locale by setting the 'LC_ALL' environment variable to
> the value 'C'.
For the record, POSIX says this:
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html:
> Range expressions are, historically, an integral part of REs. However, the
> requirements of "natural language behavior" and portability do conflict. In
> the POSIX locale, ranges must be treated according to the collating sequence
> and include such characters that fall within the range based on that
> collating sequence, regardless of character values. In other locales, ranges
> have unspecified behavior.
I am marking the auto-created issue as "not-a-bug", and can't even
(reasonably) label it as "wishlist", because allowing what your usage
implies is fundamentally contradictory.
You're welcome to continue the discussion here.