emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Design decision of string in Emacs


From: Zhu Zihao
Subject: Design decision of string in Emacs
Date: Wed, 16 Dec 2020 21:12:41 +0800
User-agent: mu4e 1.4.13; emacs 27.1

Recently I'm surfing on Emacs China forum and see a weird question[1]

```
(string-bytes (concat (symbol-name 'GET) (encode-coding-string "我" 'utf-8)))
;; => 9

(string-bytes (concat (symbol-name 'GET) (encode-coding-string "foo" 'utf-8)))
;; => 6

(string-bytes (concat "GET" (encode-coding-string "我" 'utf-8)))
;; => 6
```

While concatenating string return from `symbol-name` and encoded CJK
characters, the result bytes are longer than expected.

Curiosity drives me to do some research on this. After reading a lot
manual and source code(mule-conf.el, lread.c) and some experiment made by 
myself.

My conclusion is:

1. While concatenating unibyte string between multibyte string, Emacs will
convert bytes to eight-bit char in #x3FFF80..#x3FFFFF.

2. symbol-name return a multibyte string, because symbol name should
always be "multibyte string" but not bytes, so even symbol name only
contains ASCII characters, Emacs will mark it as multibyte string.

3. string constructed by reader, will first assume it's a unibyte
string, if reader encounters any multibyte char, then mark it as
multibyte string, that's why (string-bytes (concat "GET"
(encode-coding-string "我" 'utf-8))) returns 6 because Emacs consider
this is a concat between two unibyte string.

IMO, multibyte string in Emacs is like "string", unibyte string is like
a vector of u8 number. 

In some language, bytes and strings are different types and they can't
be concat without conversion. And attempts to convert invalid bytes to a
string will throw an error. But Emacs extends Unicode charset to tolerate
these malformed bytes.

I'm interesting on following points.

1. Why Emacs use same type to represent both bytes and string? Putting
them in different type(if we have a time-machine) may be much clearer
and avoid some confusion

2. Why Emacs extend Unicode charset to hold single eight-bit? I don't
know if there's any pratical use.

3. Is there any existing best pratice in manipulating strings and bytes?
If there's none. We may discuss and record it to Elisp manual.


[1]: 
https://emacs-china.org/t/concat-symbol-name-get-encode-coding-string-utf-8-bytes/15350

-- 
Retrieve my PGP public key:

  gpg --recv-keys D47A9C8B2AE3905B563D9135BE42B352A9F6821F

Zihao

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]