[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps
From: |
YAMAMOTO Mitsuharu |
Subject: |
bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps |
Date: |
Fri, 24 Jul 2009 10:08:11 +0900 |
User-agent: |
Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.8 (Shijō) APEL/10.6 Emacs/22.3 (sparc-sun-solaris2.8) MULE/5.0 (SAKAKI) |
>>>>> On Mon, 29 Jun 2009 10:47:30 +0200, Stefan Monnier
>>>>> <monnier@iro.umontreal.ca> said:
>> It seemed to be too obvious to explain and I hesitated to do that.
>> Anyway, I assume "C" and "[C]" work equivalently as regexps if the
>> character C has no special meaning in either context.
> Yes, it's pretty obvious, thank you. I haven't had time to look
> deeper, but that part of the code is pretty nasty because it tries
> to be clever about the fact that values between 128-256 can be
> either latin-1 chars and eight-bit-bytes and it tries to be lenient
> about confusion between the two.
Are there any written specifications explaining how the leniency is
supposed to work?
As for documentations, the description below in the elisp info
(Special Characters in Regular Expressions) probably needs to be
updated.
The beginning and end of a range of multibyte characters must be in
the same character set (*note Character Sets::). Thus,
`"[\x8e0-\x97c]"' is invalid because character 0x8e0 (`a' with
grave accent) is in the Emacs character set for Latin-1 but the
character 0x97c (`u' with diaeresis) is in the Emacs character set
for Latin-2. (We use Lisp string syntax to write that example,
and a few others in the next few paragraphs, in order to include
hex escape sequences in them.)
If a range starts with a unibyte character C and ends with a
multibyte character C2, the range is divided into two parts: one
is `C..?\377', the other is `C1..C2', where C1 is the first
character of the charset to which C2 belongs.
You cannot always match all non-ASCII characters with the regular
expression `"[\200-\377]"'. This works when searching a unibyte
buffer or string (*note Text Representations::), but not in a
multibyte buffer or string, because many non-ASCII characters have
codes above octal 0377. However, the regular expression
`"[^\000-\177]"' does match all non-ASCII characters (see below
regarding `^'), in both multibyte and unibyte representations,
because only the ASCII characters are excluded.
YAMAMOTO Mitsuharu
mituharu@math.s.chiba-u.ac.jp
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- bug#3687: 23.1.50; inconsistency in multibyte eight-bit regexps,
YAMAMOTO Mitsuharu <=