[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unibyte characters, strings, and buffers

From: David Kastrup
Subject: Re: Unibyte characters, strings, and buffers
Date: Sat, 29 Mar 2014 11:42:43 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

"Stephen J. Turnbull" <address@hidden> writes:

> Eli Zaretskii writes:
>  > How is it different?  What would be the encoding of a buffer that
>  > contains raw bytes?
> Depends.  If it's uninterpreted bytes, "binary."  If those are
> undecodable bytes, they'll be the representation of raw bytes that
> occurred in an otherwise sane encoded stream, and the buffer's
> encoding will be the nominal encoding of that stream.

It's worth pointing out that there is no such thing as a "buffer's
encoding" in general in Emacs.  Buffers are sequences of characters or,
in the case of a unibyte buffer, bytes.  Encodings come into play for
import/export only but they are not an inherent property of the buffer
as such but rather, for example, of the file association of the buffer.

Emacs has two kinds of internal representation (what one might actually
want to call "buffer encoding"): unibyte and multibyte.  XEmacs, I
think, has only one.

The current point of contention is about changing the way of
codepoint-based character operations depending on the unibyte state of
the current buffer.

I consider that an astonishingly bad idea since character and string
operations are not tied to a particular buffer.  The whole point of MULE
from a rather early point of time on was to deal with only a single
Unicode-based character set in all of Emacs.  Making character
operations change meaning based on a buffer's unibyte status means a
return to the character set semantics of Emacs 19.

I am not necessarily of the same opinion as Stephen regarding whether or
not abolishing unibyte buffers is a worthwhile goal.  But I am pretty
sure that "unibyte" should not be bleeding over into character and
string operations.

A unibyte buffer or unibyte string might error out when trying to insert
characters out of the range 0..255.  That's an obvious consequence of
the buffer's representation.

If we want different semantics for case-fold-search in binary buffers,
then the solution is setting a buffer-local setting of case-fold-search
when opening a buffer intended to be manipulated in a binary way.

But the unibyte setting of the buffer should not affect normal character
and string operation semantics.  It is a buffer implementation detail
that should not really have a visible effect apart from making some
buffer operations impossible.

Whether or not we want to abolish unibyte buffer representations, we
don't want this to bleed effects beyond the buffer representation.

If something chooses a unibyte buffer representation for some reason, it
is the responsibility of the same something to switch character
operations and case-fold-search etc to something making sense in the
context of its operation.  That may well be through some buffer-local
setting of case-fold-search etc, but it is not tied to the internal
representation of the buffer contents.

David Kastrup

reply via email to

[Prev in Thread] Current Thread [Next in Thread]