[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unibyte characters, strings, and buffers
From: |
David Kastrup |
Subject: |
Re: Unibyte characters, strings, and buffers |
Date: |
Sat, 29 Mar 2014 18:16:39 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux) |
Nathan Trapuzzano <address@hidden> writes:
> "Stephen J. Turnbull" <address@hidden> writes:
>
>> What is relevant is how to represent byte streams in Emacs. The
>> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
>> characters. It is *extremely* convenient if the first 128 of those
>> bytes correspond to the ASCII coded character set, because so many
>> wire protocols use ASCII "words" syntactically. The other 128 don't
>> matter much, so why not just use the extremely convenient Latin-1 set
>> for them?
>
> Sorry if someone brought this up already, but one reason raw bytes
> shouldn't be represented as Latin-1 characters is that the "raw
> bytes"-ness would be lost when writing them back to disk if the stream
> also contained characters outside the Latin-1 range.
No.
> For example, say we decode a stream of raw bytes as utf8, but that the
> stream contains some non-utf8 sequences. IIUC, Emacs will interpret
> those as "raw bytes", so that when it goes to encode the string to write
> it back, they will be written back verbatim.
"Raw bytes" here are represented as particular characters outside of the
Unicode range. They are representable in multibyte buffers. They never
were representable in unibyte buffers. While it is conceivable to map
characters 128..255 in unibyte strings/buffers to the respective
character codes outside of the Unicode range, that would render
programmatic manipulation of bytes strenuous.
> Whereas, if they had been interpreted as Latin-1 characters, they
> would get written back as the UTF8 equivalents. Hence you have the
> odd situation where you can decode and then encode and end up with a
> different string.
No, you can't unless you decode into a unibyte buffer, and then all bets
are off regarding reencoding.
--
David Kastrup
- Re: Unibyte characters, strings, and buffers, (continued)
- Re: Unibyte characters, strings, and buffers, Andreas Schwab, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Stephen J. Turnbull, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Andreas Schwab, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Nathan Trapuzzano, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Nathan Trapuzzano, 2014/03/29
- Re: Unibyte characters, strings, and buffers, David Kastrup, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Nathan Trapuzzano, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Richard Stallman, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Andreas Schwab, 2014/03/30
- Re: Unibyte characters, strings, and buffers, Richard Stallman, 2014/03/30
- Re: Unibyte characters, strings, and buffers,
David Kastrup <=
- Re: Unibyte characters, strings, and buffers, Daniel Colascione, 2014/03/28
- Re: Unibyte characters, strings, and buffers, Glenn Morris, 2014/03/28
- Re: Unibyte characters, strings, and buffers, Stephen J. Turnbull, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Eli Zaretskii, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Stephen J. Turnbull, 2014/03/29
- Re: Unibyte characters, strings, and buffers, Eli Zaretskii, 2014/03/31
- Re: Unibyte characters, strings, and buffers, Stephen J. Turnbull, 2014/03/31