Re: Unibyte characters, strings, and buffers

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unibyte characters, strings, and buffers

From:	David Kastrup
Subject:	Re: Unibyte characters, strings, and buffers
Date:	Sat, 29 Mar 2014 18:16:39 +0100
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

Nathan Trapuzzano <address@hidden> writes:

> "Stephen J. Turnbull" <address@hidden> writes:
>
>> What is relevant is how to represent byte streams in Emacs.  The
>> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
>> characters.  It is *extremely* convenient if the first 128 of those
>> bytes correspond to the ASCII coded character set, because so many
>> wire protocols use ASCII "words" syntactically.  The other 128 don't
>> matter much, so why not just use the extremely convenient Latin-1 set
>> for them?
>
> Sorry if someone brought this up already, but one reason raw bytes
> shouldn't be represented as Latin-1 characters is that the "raw
> bytes"-ness would be lost when writing them back to disk if the stream
> also contained characters outside the Latin-1 range.

No.

> For example, say we decode a stream of raw bytes as utf8, but that the
> stream contains some non-utf8 sequences.  IIUC, Emacs will interpret
> those as "raw bytes", so that when it goes to encode the string to write
> it back, they will be written back verbatim.

"Raw bytes" here are represented as particular characters outside of the
Unicode range.  They are representable in multibyte buffers.  They never
were representable in unibyte buffers.  While it is conceivable to map
characters 128..255 in unibyte strings/buffers to the respective
character codes outside of the Unicode range, that would render
programmatic manipulation of bytes strenuous.

> Whereas, if they had been interpreted as Latin-1 characters, they
> would get written back as the UTF8 equivalents.  Hence you have the
> odd situation where you can decode and then encode and end up with a
> different string.

No, you can't unless you decode into a unibyte buffer, and then all bets
are off regarding reencoding.

-- 
David Kastrup

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Unibyte characters, strings, and buffers, (continued)

Prev by Date: Re: Unibyte characters, strings, and buffers
Next by Date: Re: Unibyte characters, strings, and buffers
Previous by thread: Re: Unibyte characters, strings, and buffers
Next by thread: Re: Unibyte characters, strings, and buffers
Index(es):
- Date
- Thread