bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#6283: doc/lispref/searching.texi reference to octal code `0377' corr


From: MON KEY
Subject: bug#6283: doc/lispref/searching.texi reference to octal code `0377' correct?
Date: Wed, 2 Jun 2010 15:41:38 -0400

As this bug seems closed I'm replying in reverse for the sake of
brevity w/re others future perusal.

On Tue, Jun 1, 2010 at 2:38 PM, Eli Zaretskii <address@hidden> wrote:
> I hope my answers make this issue more clear.

Yes, Thank You. I appreciate that you've been generous in sharing time
to help make this distinction more clear.

> (Did I say that use of raw bytes is complicated and full of subtleties?)

Indeed. It is definitely something I've personally had trouble grasping
Thanks again.

>> I am unable to locate the character: ÿ (255, #o377, #xff) e.g.
>> LATIN SMALL LETTER Y WITH DIAERESIS

> Sounds like a bug to me --- not in the conventions used by the
> manual, but rather in regexp search in Emacs.  Feel free to file a
> separate bug about that.

Given my current trepidations I'm not sure how to characterize the bug
(if any) nor if I am the right person to do so.

Are you able to reproduce this behaviour?

Feel free to reply to the rest of this mail in private should you be
so inclined:

> Because ÿ is a character, whereas `(multibyte-char-to-unibyte 4194303)'
> is a raw byte.

So, would it be reasonable of me to characterize the mechanism of
Emacs regexps as (conceptually) searching over an in memory numeric
representation of character codepoints where a given character has a
numeric value (regardless of the radix notation used to represent it)
which falls within the numerical range of 22-bit numbers represented
by the set of integers encompassed by the return value of (max-char)?

IOW (search-forward-regexp "ÿÿÿ") doesnt' match three `ÿ's so much as
it attempts to match against whatever in memory representation Emacs
currently has for the current buffer's character set by moving across
an array of integers (which correspond to the buffer numeric character
values) looking for a particular sequence of integer value(s). That we
aren't matching the character represented by a respective codepoint
but rather the integer value which maps to that character's respective
codepoint according to the current buffer's coding system.

Which is to say in a buffer having the `buffer-file-coding-system'
value utf-8-unix and which contains the characters: "set of ÿÿÿ chars"
the regexp:

 (search-forward-regexp "ÿÿÿ")

is (conceptually) equivalent to searching across this array:

 [115 101 116 32 111 102 32 255 255 255 32 99 104 97 114 115]

for the sequence of consecutive adjacent integers with the value 255.

And, that were this a search for three consectuive raw-byte
characters with the multibyte numeric value 4194303, the regexp:

 (search-forward-regexp "\377\377\377")

is (conceptually) equivalent to searching across this array:

 [115 101 116 32 111 102 32 4194303
  4194303 4194303 32 99 104 97 114 115]

for three consecutive adjacent integers with the value 4194303.

With this latter integer (4194303), it so happens, being the decimal
value representing the uppermost of Emacs' internal `codespace'.
Where this `codespace' is the is understood as the range of the set of
characters which may be represented by the positive numerical range of
the 22-bit number corresponding to the integer return value of
`max-char', e.g.:

 (max-char) => 4194303 (#o17777777, #x3fffff)

Such that `max-char's numerical value (and lesser positive values
therof) may be presented to the Emacs lisp readers in various ways
including -- and in addition to decimal (base 10) notation -- those
integer values represented with the reader syntax:

  #<radix>N and #<R>rN

in any number of radix in incluing 10, 8, 16, and 2 as follows:

 decimal value     4194303    or #10r4194303

 octal value       #o17777777 or #8r17777777

 hexidecimal value #x3fffff   or #16r3fffff

 binary value      #b01111111111111111111111
                or #2r01111111111111111111111

Where this particular numeric value is more widely understood as:
raw-byte 255

This `raw-byte' being understood more generally as the uppermost in the
so called `octal range': 0200-0377

With the `octal range' being otherwise represented within the Emacs
codespace at its upper bounds as the final range of 127 numeric
character values beginning from the code offset 4194176
(inclusive). Such that the range of raw-bytes 127-255 beginning with
the codespace's integer value 4194176 and extendingto 4194303 e.g.:

 (cons 4194176 (+ 4194176 (- 255 128)))

And may more generally be represented in Emacs as:

numeric code-point range:  0x80 - 0xFF

decimal range:             4194176 - 4194303

octal range:               #o17777600 - #o17777777

hexidecimal range:         #x3fff80   - #x3fffff

binary range:              #b01111111111111110000000 - #b01111111111111111111111

--
/s_P\





reply via email to

[Prev in Thread] Current Thread [Next in Thread]