Careful here. As I just recently learned, there are languages where
a lower case character is one byte and the upper case equivalent is a
multibyte character. (Or vice versa, I don't remember.) Thus, the
'a' -> '[aA]' solution is fine for ASCII, but doesn't generalize for
other
character sets. Or least not simply.
Having a single-byte character and a multi-byte character in the same
character class works fine here in UTF-8. Why do you think there
would be problems with this approach?
Tim.
I don't know if there would be problems or if there wouldn't be, but
the code doing this can't be naive and just do
if (ignoring case) {
buffer[i++] = '[';
buffer[i++] = c;
buffer[i++] = toupper(c);
buffer[i++] = ']';
}
It has to be somewhat smarter. Also, UTF-8 isn't the only multibyte
encoding that GLIBC and thus GNU can handle...
I'm a parochial American and thus find all the multibyte stuff to
be a pain, but that's just me personally. :-) Gawk still isn't
really multibyte aware. For example, the length() function returns
bytes, not characters, and I have no idea as to whether index()
really works correctly in multibyte characters. Similar for substr().
(If anyone here is a guru and wants to help out with these things,
let me know! :-)
Arnold