[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Case insensitivity seems to ignore lower bound of interval
From: |
Eric Bischoff |
Subject: |
Re: Case insensitivity seems to ignore lower bound of interval |
Date: |
Wed, 4 May 2011 09:43:45 +0200 |
User-agent: |
KMail/1.13.6 (Linux/2.6.38-8-generic; KDE/4.6.2; x86_64; ; ) |
Le vendredi 29 avril 2011 09:55:23, Aharon Robbins a écrit :
> Davide Brini states:
> > You seem to think this is gawk-specific, but in fact any locale-aware
> > tool that uses regular expressions behaves the same (try eg with sed or
> > grep).
>
> And this too is correct.
It isn't. See the result of tests that followed: sed, awk and grep just don't
behave the same, at least in the versions shipped with distributions.
> POSIX locales (in my not-so-humble opinion) are
> a total and utter botch.
On that we all agree ;-).
> [[:lower:]], [[:upper:]] and so on exist to mitigate this issue. They are
> not perfect solutions.
Yes.
> > Collation [...]
>
> Collation has to do with sorting order, and less so with regular expression
> matching. Gawk doesn't support [[=e=]] which is supposed to match all
> versions of the letter 'e'.
OK, did not know.
> I agree, which is why I've clarified the doc and changed the code, but
> again, this is not a gawk-specific issue but a general locale issue.
Have the library writers been contacted? Since the problems seem to rely
there, wouldn't that the logical thing to do?
> > One technical possibility would be to simply use Unicode code positions.
>
> Unfortunately, no. Gawk is used in many parts of the world where Unicode
> is not the standard character set (Japan, China, etc.)
I was suggesting to convert internally to Unicode from other character sets
before doing anything else. I'm not sure this is a good idea though in the
case of awk. But it's a common technique to work internally in unicode.
Also, Unicode is becoming standard everywhere in the world, replacing all
older encodings. That includes China and Japan.
> and restricting gawk to just Unicode would not be a good idea.
That was not what I suggested. Sorry if I wasn't clear.
> If you still disagree, then I'm sorry, there's nothing else I can do
> to help.
I'm sorry I did not understand in the first place you initial message, saying
it was already solved. Please accept my apologies for that.
--
Éric Bischoff - Bureau Cornavin
Technical writing and translations
http://www.bureau-cornavin.com
(+33) 3 68 46 00 85
sip:address@hidden
- Re: Case insensitivity seems to ignore lower bound of interval,
Eric Bischoff <=