Eli Zaretskii <address@hidden
> schrieb am So., 22. Nov. 2015 um 19:04 Uhr:
> From: Philipp Stephani <address@hidden>
> Date: Sun, 22 Nov 2015 14:56:12 +0000
> Cc: address@hidden, address@hidden, address@hidden
> - The multibyte API should use an extension of UTF-8 to encode Emacs strings.
> The extension is the obvious one already in use in multiple places.
It is only used in one place: the internal representation of
characters in buffers and strings. Emacs _never_ lets this internal
representation leak outside.
If I run in scratch:
Then the resulting help buffer says "buffer code: #xF8 #x8F #xBF #xBD #x80", is that not considered a leak?
In practice the last sentence means that
text that Emacs encoded in UTF-8 will only include either valid UTF-8
sequences of characters whose codepoints are below #x200000 or single
bytes that don't belong to any UTF-8 sequence.
I get the same result as above when running
(call-process "echo" nil t nil (string #x3fff40))
that means the non-UTF sequence is even "leaked" to the external process!
You are suggesting to expose the internal representation to outside
application code, which predictably will cause that representation to
leak into Lisp. That'd be a disaster. We had something like that
back in the Emacs 20 era, and it took many years to plug those leaks.
We would be making a grave mistake to go back there.
I don't suggest leaking anything what isn't already leaked. The extension of the codespace to 22 bits is well documented.
What you suggest is also impossible without deep changes in how we
decode and encode text: that process maps codepoints above #1FFFFF to
either codepoints below that mark or to raw bytes. So it's impossible
to produce these high codes in UTF-8 compatible form while handling
UTF-8 text. To say nothing about the simple fact that no library
function in any C library will ever be able to do anything useful with
such codepoints, because they are our own invention.
Unless the behavior changed recently, that doesn't seem the case:
(encode-coding-string (string #x3fff40) 'utf-8-unix)
Or are you talking about something different?
> - There should be a one-to-one mapping between Emacs multibyte strings and
> encoded module API strings.
UTF-8 encoded strings satisfy that requirement.
No! UTF-8 can only encode Unicode scalar values. Only the Emacs extension to UTF-8 (which I think Emacs calls "UTF-8" unfortunately) satisfies this. If you are talking about this extension, then we talk about the same thing anyway.
> Therefore non-shortest forms, illegal code unit sequences, and code
> unit sequences that would encode values outside the range of Emacs
> characters are illegal and raise a signal.
Once again, this was tried in the past and was found to be a bad idea.
Emacs provides features to test the result of converting invalid
sequences, for the purposes of detecting such problems, but it leaves
that to the application.
It's probably OK to accept invalid sequences for consistency with decode-coding-string and friends. I don't really like it though: the module API, like decode-coding-string, is not a general-purpose UI for end users, and accepting invalid sequences is error-prone and can even introduce security issues (see e.g. https://blogs.oracle.com/CoreJavaTechTips/entry/the_overhaul_of_java_utf
> Likewise, such sequences will never be returned from Emacs.
Emacs doesn't return invalid sequences, if the original text didn't
include raw bytes. If there were raw bytes in the original text,
Emacs has no choice but return them back, or else it will violate a
basic expectation from a text-processing program: that it shall never
change the portions of text that were not affected by the processing.
It seems that Emacs does return invalid sequences for characters such as #x3ffff40 (or anything else outside of Unicode except the 127 values for encoding raw bytes).
Returning raw bytes means that encoding and decoding isn't a perfect roundtrip:
(decode-coding-string (encode-coding-string (string #x3fffc2 #x3fffbb) 'utf-8-unix) 'utf-8-unix)
We might be able to live with that as it's an extreme edge case.
> I think this is a relatively simple and unsurprising approach. It allows
> encoding the documented Emacs character space while still being fully
> compatible with UTF-8 and not resorting to undocumented Emacs internals.
So does the approach I suggested. The advantage of my suggestion is
that it follows a long Emacs tradition about every aspect of encoding
and decoding text, and doesn't require any changes in the existing
What are the exact difference between the approaches? As far as I can see differences exist only for the following points:
- Accepting invalid sequences. I consider that a bug in general-purpose APIs, including decode-coding-string. However, given that Emacs already extends the Unicode codespace and therefore has to accept some invalid sequences anyway, it might be OK if it's clearly documented.
- Emitting raw bytes instead of extended sequences. Though I'm not a fan of this it might be unavoidable to be able to treat strings transparently (which is desirable).