[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unibyte characters, strings, and buffers

From: Nathan Trapuzzano
Subject: Re: Unibyte characters, strings, and buffers
Date: Sat, 29 Mar 2014 13:01:17 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

"Stephen J. Turnbull" <address@hidden> writes:

> What is relevant is how to represent byte streams in Emacs.  The
> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
> characters.  It is *extremely* convenient if the first 128 of those
> bytes correspond to the ASCII coded character set, because so many
> wire protocols use ASCII "words" syntactically.  The other 128 don't
> matter much, so why not just use the extremely convenient Latin-1 set
> for them?

Sorry if someone brought this up already, but one reason raw bytes
shouldn't be represented as Latin-1 characters is that the "raw
bytes"-ness would be lost when writing them back to disk if the stream
also contained characters outside the Latin-1 range.

For example, say we decode a stream of raw bytes as utf8, but that the
stream contains some non-utf8 sequences.  IIUC, Emacs will interpret
those as "raw bytes", so that when it goes to encode the string to write
it back, they will be written back verbatim.  Whereas, if they had been
interpreted as Latin-1 characters, they would get written back as the
UTF8 equivalents.  Hence you have the odd situation where you can decode
and then encode and end up with a different string.

Someone brought up Python in another post.  Python (version 3 at least)
does the same thing when, e.g., interpreting filenames.  If you pass a
string (_not_ bytes) to os.listdir, but the contents of the directory
can't all be decoded as utf-8, it will return strings (_not_ bytes)
where the non-utf8 sequences are Python-specific "characters" (in the
Unicode private use areas I believe) representing "raw bytes",
i.e. entities to be written back to the disk as the same raw sequences
that were read therefrom.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]