bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-grep] Re: grep: -i option not working i cronjobs


From: Elliott Hughes
Subject: Re: [bug-grep] Re: grep: -i option not working i cronjobs
Date: Sun, 14 Nov 2004 10:20:24 -0800

i think the kind of thing Aharon was thinking of was something like the character ß (Latin small letter sharp s) in de_DE (as opposed to de_CH), which has no upper-case form, and would be the two *characters* "SS".

Java understands the conversion, but doesn't think they match. i'm not German, but that seems wrong to me. (but then, given Swiss German, i'd probably always want "ss" and "\u00df" to match in free-text search applications.)

$ cat > t.java
public class t {
 public static void main(String[] args) {
  String latinSmallLetterSharpS = "\u00df";
  System.out.println(latinSmallLetterSharpS);
  System.out.println(latinSmallLetterSharpS.toUpperCase());

  System.out.println("schliesslich".matches("(?i)SCHLIESSLICH"));
  System.out.println("schlie\u00dflich".matches("(?i)SCHLIESSLICH"));
System.out.println("schlie\u00dflich".matches("(?i)SCHLIE\u00dfLICH")); // You wouldn't write this.
 }
}
$ javac t.java && java t
ß
SS
true
false
true
$

--
http://www.jessies.org/~enh/

On Nov 14, 2004, at 04:09, Aharon Robbins wrote:

Careful here.  As I just recently learned, there are languages where
a lower case character is one byte and the upper case equivalent is a
multibyte character.  (Or vice versa, I don't remember.)  Thus, the
'a' -> '[aA]' solution is fine for ASCII, but doesn't generalize for other
character sets.  Or least not simply.

Having a single-byte character and a multi-byte character in the same
character class works fine here in UTF-8.  Why do you think there
would be problems with this approach?

Tim.

I don't know if there would be problems or if there wouldn't be, but
the code doing this can't be naive and just do

        if (ignoring case) {
                buffer[i++] = '[';
                buffer[i++] = c;
                buffer[i++] = toupper(c);
                buffer[i++] = ']';
        }

It has to be somewhat smarter.  Also, UTF-8 isn't the only multibyte
encoding that GLIBC and thus GNU can handle...

I'm a parochial American and thus find all the multibyte stuff to
be a pain, but that's just me personally. :-)  Gawk still isn't
really multibyte aware.  For example, the length() function returns
bytes, not characters, and I have no idea as to whether index()
really works correctly in multibyte characters.  Similar for substr().
(If anyone here is a guru and wants to help out with these things,
let me know! :-)

Arnold







reply via email to

[Prev in Thread] Current Thread [Next in Thread]