[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16631: Consideration of title case on case-insensitive matching
From: |
Norihiro Tanaka |
Subject: |
bug#16631: Consideration of title case on case-insensitive matching |
Date: |
Sat, 08 Feb 2014 01:49:47 +0900 |
Paul Eggert wrote:
> 1. It doesn't solve the problem from the ordinary user's point of view.
> For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i ?" will still
> output nothing, because the one-character pattern "?" does not match
> the two-character string "lj" even when the latter's two-letter case
> variants "Lj", "lJ", "LJ" are considered.
>
> 2. The characters in question are present in Unicode only for
> compatibility with previous standards; they're not intended to be used
> in new text. So this is a problem of the past, one that has mostly died
> out already.
>
> 3. Because of (2) the characters in question are rare, even in the
> languages where one might naively think they're useful. For example,
> the Croatian Wikipedia page for Ljubljana
> <http://hr.wikipedia.org/wiki/Ljubljana>
> consistently uses the two-character forms "Lj" and "lj", not the
> one-character forms "?" and "?".
>
> 4. The solution doesn't generalize to similar problems in more-complicated
> orthographies. For example, in polytonic Greek when ignoring case
> ordinary users would expect "?" (U+1F84) to match not only "?" (U+1F8C),
> but also "?" (U+0391), "??" (U+0391, U+0399; two characters) and "??"
> (U+0391, U+03B9). Worse, this depends on context: often "?" should
> not match "??" when ignoring case. For details on this, please see
> Nick Nicholas's discussion "Titlecase and Adscripts"
> <http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.
>
> I think that it's because the problem is glibc doesn't define conversion
> between two-character string "lj" and single-character Lj, "?" (U+1F8C)
> and "?" (U+0391) etc.
For example, grep on HP-UX, I look like it's quitely compliant with POSIX,
supports conversion between single-character "lj" and single-character
"Lj", though dones't support conversion as above.
I believe that the conversion rule is in compliance with the locale-data
of libc is required. I look like the convesion beween "Lj", "lJ" and "LJ"
is defined in UTF-8, but not defined between U+1F84 and U+0391 etc.
> 5. When POSIX specifies how to match a regular expression while ignoring
> case, it talks only about "uppercase or lowercase"
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>.
>
> If we change 'grep' along the lines being suggested, we'll either have
> to change POSIX, or have the change take effect only if POSIXLY_CORRECT
> is not set.
The upper case of single-character "Lj" is "LJ" and the case is "lj".
Thire conversion are also supported by towupper and towlower functions.
Aharon Robbins wrote:
> This is an issue for gawk.
I seem that I have misunderstood. The problem doesn't reproduce on
grep-2.16. It's taken by the patch for bug#16421, which removes
GREP-oriented dfa.c.