[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Design decision of string in Emacs

From: Stefan Monnier
Subject: Re: Design decision of string in Emacs
Date: Wed, 16 Dec 2020 09:56:28 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)

> ```
> (string-bytes (concat (symbol-name 'GET) (encode-coding-string "我" 'utf-8)))
> ;; => 9
> (string-bytes (concat (symbol-name 'GET) (encode-coding-string "foo" 'utf-8)))
> ;; => 6
> (string-bytes (concat "GET" (encode-coding-string "我" 'utf-8)))
> ;; => 6
> ```

Oh, you're looking at the ugly mess we still have under the carpet, huh?

[ Based on the rest of what you wrote I gather than you did figure out
  what's going on: congratulations!  ]

> 1. Why Emacs use same type to represent both bytes and string? Putting
> them in different type(if we have a time-machine) may be much clearer
> and avoid some confusion

Emacs started with 8-bit characters, so there was no good reason to
distinguish sequences of bytes from sequences of characters.
When support for larger character sets was introduced (in MULE), the
need to work with existing ELisp code made it necessary to be very
permissive w.r.t confusions between chars and bytes.

This lead to introducing 2 types (unibyte and multibyte strings) but
pretending as hard as possible that it's still just a single type.
Also when MULE was merged into the official version of Emacs, the
original focus was in trying to avoid regressions, so it was important
to automatically treat bytes as "iso-8859-1 chars" and vice versa, like
the old Emacs used to do.

Over time, we have made the distinction a bit more strong, introducing
a few more checks and signaling a few more errors, but we're still very
much in the "DWIM" world.  A big reason for that is that there's no
distinction (in the printed representation) between unibyte and
multibyte for strings which only contain ASCII.

In my local/personal Emacs branch, I tried to improve this (to try and
avoid the kind of inconsistency you show in your example above, for
example) by treating "ASCII strings" specially, considering them to be
both unibyte and multibyte at the same time.  It kinda works, but it's
not clear it's a sufficient improvement to justify the (minor) backward
incompatibility it introduces.

> 2. Why Emacs extend Unicode charset to hold single eight-bit?
> I don't know if there's any pratical use.

Ah, that question is much simpler: when reading a file labeled as using
utf-8 bytes, we need to handle the case where the content is actually
not valid utf-8.  We could just signal an error and refuse to read the
file, but we decided instead to make it possible to read and edit such
files by representing (in the buffer) the invalid byte sequences using
those special "eight-bit byte characters".  This way, you can edit
a "mostly utf-8 file with some invalid byte sequences" just fine and
those invalid byte sequences will be properly preserved when you save
the file.

Of course, that is used also for other encodings than utf-8.

> 3. Is there any existing best pratice in manipulating strings and bytes?
> If there's none. We may discuss and record it to Elisp manual.

Not really.  My own (very general) recommendation is to try and remember
that unibyte strings are sequences of bytes while multibyte strings are
sequences of characters and to try and keep it very clear in your head
when you're manipulating bytes and when you're manipulating characters.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]