bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#6283: doc/lispref/searching.texi reference to octal code `0377' corr


From: Kevin Rodgers
Subject: bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
Date: Thu, 03 Jun 2010 08:39:44 -0600
User-agent: Thunderbird 2.0.0.24 (Macintosh/20100228)

MON KEY wrote:
>> Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
>> is a raw byte.
>
> So, would it be reasonable of me to characterize the mechanism of
> Emacs regexps as (conceptually) searching over an in memory numeric
> representation of character codepoints where a given character has a
> numeric value (regardless of the radix notation used to represent it)
> which falls within the numerical range of 22-bit numbers represented
> by the set of integers encompassed by the return value of (max-char)?

Sure.  But it doesn't make sense to me to even consider "the radix notation
used to represent it".  Characters are read, usually from buffers (including
the minibuffer), and the notation is only relevant with respect to the
buffer or keyboard coding system because each character is exactly that:
a character, represented internally as an integer.

> IOW (search-forward-regexp "ÿÿÿ") doesnt' match three `ÿ's so much as
> it attempts to match against whatever in memory representation Emacs
> currently has for the current buffer's character set by moving across
> an array of integers (which correspond to the buffer numeric character
> values) looking for a particular sequence of integer value(s). That we
> aren't matching the character represented by a respective codepoint
> but rather the integer value which maps to that character's respective
> codepoint according to the current buffer's coding system.

Why does the distinction between the codepoint and the representation matter,
since there is a 1:1 relationship between them?

I think that character sets and coding systems are irrelevant at this point:
the coding system was used to convert the text to the internal representation
when it was read into memory.  The only character set that matters is Unicode,
the only codepoints that matter are Unicode and Emacs' internal representation.

I just verified that like this: Unicode has the same codepoint →
character mappings as ASCII and ISO-8859-1, but ISO-8859-2 has different
characters than Unicode at some codepoints.  For example, codepoint xA1
aka o241 aka 161 is INVERTED EXCLAMATION MARK in Unicode but LATIN
CAPITAL LETTER A WITH OGONEK in ISO-8859-2.

If I have a UTF-8 buffer and an ISO-8859-2 buffer, `M-: (ucs=insert
0104)' inserts the same character into both, as expected: LATIN CAPITAL
LETTER A WITH OGONEK.  The only difference in the output from `C-u C-x
=' are the file codes -- the internal buffer codes are the same.

I thought that perhaps C-q 241 would insert different characters into the
buffers, since their coding systems assign different characters to that
codepoint, but they don't: in both cases, it is INVERTED EXCLAMATION
MARK.

So it seems that Unicode is used regardless of buffer-coding-system.  Even
`C-x RET c iso-8859-2 RET C-q 241' inserts INVERTED EXCLAMATION MARK, not
LATIN CAPITAL LETTER A WITH OGONEK.

Perhaps someone can explain how to insert a character using its numeric
codepoint in a specific character set?

--
Kevin Rodgers
Denver, Colorado, USA






reply via email to

[Prev in Thread] Current Thread [Next in Thread]