|
From: | Paul Eggert |
Subject: | bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase |
Date: | Wed, 05 Mar 2014 10:50:54 -0800 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 |
On 03/05/2014 07:11 AM, Norihiro Tanaka wrote:
I still believe that upper or lower case of a character should also match title case
The (soon-to-be-fixed) gnulib regex code agrees with you, assuming that towupper (X) agrees for all three values of X, because it uses (towupper (input) == towupper (pattern)). However, the most-plausible reading of POSIX does not agree with you, as it would require (input == pattern || towlower (input) == pattern || towupper (input) == pattern), which means a titlecase pattern will match only itself.
It seems pretty clear to me that the most-plausible reading of POSIX is buggy, for this reason. No wonder so many implementations fail to conform to it.
I thought of a different way where gnulib/glibc regex does not conform to POSIX, and here there doesn't seem to be any ambiguity about it. In the POSIX locale when ignoring case, the pattern '[Z-a]' matches the data 'Z', 'z', 'A', 'a', and the nonalphabetic characters like '^' that collate between 'Z' and 'a'. But the glibc regex code rejects that pattern entirely. Conversely, in the same situation the glibc regex code says '[A-z]' matches only alphabetic characters, whereas POSIX says it should also match the nonalphabetic characters like '^' that collate between 'Z' and 'a'. It appears that nobody cares, as this incompatibility has been present for years and I don't recall anyone complaining. Though it is weird that this means "grep PAT" can match some lines that "grep -i PAT" doesn't.
Here POSIX is not merely ambiguous, it's clearly disagreeing with common practice. It's not clear whether the bug is in POSIX or in the implementation.
[Prev in Thread] | Current Thread | [Next in Thread] |