bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#6283: doc/lispref/searching.texi reference to octal code `0377' corr


From: Eli Zaretskii
Subject: bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
Date: Tue, 01 Jun 2010 21:38:41 +0300

> Date: Mon, 31 May 2010 20:24:00 -0400
> From: MON KEY <address@hidden>
> Cc: address@hidden
> 
> If I evauate the following:
> 
>  (progn
>    (save-excursion
>      (insert-byte (multibyte-char-to-unibyte 4194221) 1)
>      (insert-byte (multibyte-char-to-unibyte 4194303) 1))
>    (search-forward-regexp "ÿ" nil t))
> 
> I don't match.

Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
is a raw byte.  Emacs can distinguish between these two because it
uses a special multibyte representation for raw bytes, which is
different from any other Unicode character.  See this fragment from
the ELisp manual:

     Emacs defines several special character sets.  The character set
  `unicode' includes all the characters whose Emacs code points are in
  the range `0..#x10FFFF'.  The character set `emacs' includes all ASCII
  and non-ASCII characters.  Finally, the `eight-bit' charset includes
  the 8-bit raw bytes; Emacs uses it to represent raw bytes encountered
  in text.

and also this one:

     To support this multitude of characters and scripts, Emacs closely
  follows the "Unicode Standard".  The Unicode Standard assigns a unique
  number, called a "codepoint", to each and every character.  The range
  of codepoints defined by Unicode, or the Unicode "codespace", is
  `0..#x10FFFF' (in hexadecimal notation), inclusive.  Emacs extends this
  range with codepoints in the range `#x110000..#x3FFFFF', which it uses
  for representing characters that are not unified with Unicode and "raw
  8-bit bytes" that cannot be interpreted as characters.  Thus, a
  character codepoint in Emacs is a 22-bit integer number.

> Whereas if I evaluate:
> 
>  (progn
>    (save-excursion (insert 10 #o377))
>    (search-forward-regexp "ÿ" nil t))
> 
> I get a match.

Because `(insert 10 #o377)' inserts LATIN SMALL LETTER Y WITH
DIAERESIS, by design.

> Likewise, if I evaluate
> 
>  (progn (save-excursion (insert 10 4194303))
>         (search-forward-regexp "\377" nil t))
> 
> I get a match.
> 
> Which is to say, given the example regexp from the manual, i.e:
> 
> ,----
> | You cannot always match all non-ASCII characters with the regular
> | expression `"[\200-\377]"'
> `----
> 
> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
> LATIN SMALL LETTER Y WITH DIAERESIS

Sounds like a bug to me --- not in the conventions used by the
manual, but rather in regexp search in Emacs.  Feel free to file a
separate bug about that.

> To be clear, my issue isn't that I am not able to match `ÿ' but rather
> that I am able to match the raw-byte character representation with a
> visual appearance which coincides with the octal value for the `ÿ'
> character code i.e. #o377 this being otherwise widely understood as
> `octal 0377'.
> 
> I hope this is more clear than the previous mail. I apologize if it is not.

I hope my answers make this issue more clear.  (Did I say that use of
raw bytes is complicated and full of subtleties?)






reply via email to

[Prev in Thread] Current Thread [Next in Thread]