[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#33837: Unexpected result for regex with non-ascii range
From: |
Reinis Danne |
Subject: |
bug#33837: Unexpected result for regex with non-ascii range |
Date: |
Sat, 22 Dec 2018 21:43:46 +0200 |
Hi!
grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation
of yY for lv_LV.UTF-8 locale (by implementing rational range
interpretation?) [1].
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=23774
However, it seems that for ranges [a-ž] and [A-Ž] there are unexpected results:
$ echo
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[A-Ž]*'
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZ
Ž
$ echo
aAāĀbBcCčČdDeEēĒfFgGģĢhHiIīĪyYjJkKķĶlLļĻmMnNņŅoOōŌpPqQrRŗŖsSšŠtTuUūŪvVwWxXzZžŽ
| LC_COLLATE=lv_LV.UTF-8 grep -Eo '[a-ž]*'
a
āĀb
c
čČd
e
ēĒf
g
ģĢh
i
īĪy
j
k
ķĶl
ļĻm
n
ņŅo
ōŌp
q
r
ŗŖs
šŠt
u
ūŪv
w
x
z
žŽ
For the uppercase the result is completely bogus, but for the lowercase range
it seems that accented uppercase letters are interleaved with the
lowercase ones.
I would expect all letters to have their uppercase variants de-interleaved here.
I don't know if grep alters the collation rules or it is done by glibc (2.28).
strxfrm() gives me this result:
Using LC_COLLATE=lv_LV.UTF-8
char strxfrm
i c2b7010201020101e29b96
I c2b7010201070101e2afb7
ī c2b70102140102020101e29bb7
Ī c2b70102140107020101e2b096
y c2b701030102
Y c2b701030107
j c382010201020101e29c96
J c382010201070101e2b0a4
Using LC_COLLATE=C.UTF-8
char strxfrm
i 6b
I 4b
ī c4ad
Ī c4ac
y 7b
Y 5b
j 6c
J 4c
Reinis
- bug#33837: Unexpected result for regex with non-ascii range,
Reinis Danne <=