Eli Zaretskii <address@hidden
> schrieb am So., 22. Nov. 2015 um 18:35 Uhr:
> From: Philipp Stephani <address@hidden>
> Date: Sun, 22 Nov 2015 09:25:08 +0000
> Cc: address@hidden, address@hidden, address@hidden
> > Fine with me, but how would we then represent Emacs strings that are not
> > Unicode strings? Just raise an error?
> No need to raise an error. Strings that are returned to modules
> should be encoded into UTF-8. That encoding already takes care of
> these situations: it either produces the UTF-8 encoding of the
> equivalent Unicode characters, or outputs raw bytes.
> Then we should document such a situation and give module authors a way to
> detect them.
I already suggested what we should say in the documentation: that
these interfaces accept and produce UTF-8 encoded non-ASCII text.
If the interface accepts UTF-8, then it must signal an error for invalid sequences; the Unicode standard mandates this.
If the interface produces UTF-8, then it must only ever produce valid sequences, this is again required by the Unicode standard.
> For example, what happens if a sequence of such raw bytes happens
> to be a valid UTF-8 sequence? Is there a way for module code to detect this
How can you detect that if you are only given the byte stream? You
can't. You need some additional information to be able to distinguish
between these two alternatives.
That's why I propose to not encode raw bytes as bytes, but as the Emacs integer codes used to represent them.
Look, an Emacs module _must_ support non-ASCII text, otherwise it
would be severely limited, to say the least.
Having interfaces that
accept and produce UTF-8 encoded strings is the simplest complete
solution to this problem. So we must at least support that much.
Supporting strings of raw bytes is also possible, probably even
desirable, but it's an extension, something that would be required
much more rarely. Such strings cannot be meaningfully treated as
text: you cannot ask if some byte is upper-case or lower-case letter,
you cannot display such strings as readable text, you cannot count
characters in it, etc. Such strings are useful for a limited number
of specialized jobs, and handling them in Lisp requires some caution,
because if you treat them as normal text strings, you get surprises.
Yes. However, without an interface they are awkward to produce.
So let's solve the more important issues first, and talk about
extensions later. The more important issue is how can a module pass
to Emacs non-ASCII text and get back non-ASCII text. And the answer
to that is to use UTF-8 encoded strings.
> We are quite capable of quietly accepting such strings, so that is
> what I would suggest. Doing so would be in line with what Emacs does
> when such invalid sequences come from other sources, like files.
> If we accept such strings, then we should document what the extensions are.
> - Are UTF-8-like sequences encoding surrogate code points accepted?
> - Are UTF-8-like sequences encoding integers outside the Unicode codespace
> - Are non-shortest forms accepted?
> - Are other invalid code unit sequences accepted?
_Anything_ can be accepted. _Any_ byte sequence. Emacs will cope.
Not if they accept UTF-8. The Unicode standard rules out accepting invalid byte sequences.
If any byte sequence is accepted, then the behavior becomes more complex. We need to exhaustively describe the behavior for any possible byte sequence, otherwise module authors cannot make any assumption.
The perpetrator will probably get back after processing a string that
is not entirely human-readable, or its processing will sometimes
produce surprises, like if the string is lower-cased. But nothing bad
will happen to Emacs, it won't crash and won't garble its display.
Moreover, just passing such a string to Emacs, then outputting it back
without any changes will produce an exact copy of the input, which is
quite a feat, considering that the input was "invalid".
If you want to see what "bad" things can happen, take a Latin-1
encoded FILE and visit it with "C-x RET c utf-8 RET C-x C-f FILE RET".
Then play with the buffer a while. This is what happens when Emacs is
told the text is in UTF-8, when it really isn't. There's no
catastrophe, but the luser who does that might be amply punished, at
the very least she will not see the letters she expects. However, if
you save such a buffer to a file, using UTF-8, you will get the same
Latin-1 encoded text as was there originally.
Now, given such resilience, why do we need to raise an error?
The Unicode standard says so. If we document that *a superset of UTF-8* is accepted, then we don't need to raise an error. So I'd suggest we do exactly that, but describe what that superset is.
> If the answer to any of these is "yes", we can't say we accept UTF-8, because
> we don't.
We _expect_ UTF-8, and if given that, will produce known, predictable
results when the string is processed as text. We can _tolerate_
violations, resulting in somewhat surprising behavior, if such a text
is treated as "normal" human-readable text. (If the module knows what
it does, and really means to work with raw bytes, then Emacs will do
what the module expects, and produce raw bytes on output, as
No matter what we expect or tolerate, we need to state that. If all byte sequences are accepted, then we also need to state that, but describe what the behavior is if there are invalid UTF-8 sequences in the input.
> Rather we should say what is actually accepted.
Saying that is meaningless in this case, because we can accept
anything. _If_ the module wants the string it passes to be processed
as human-readable text that consists of recognizable characters, then
the module should _only_ pass valid UTF-8 sequences. But raising
errors upon detecting violations was discovered long ago a bad idea
that users resented. So we don't, and neither should the module API.
Module authors are not end users. I agree that end users should not see errors on decoding failure, but modules use only programmatic access, where we can be more strict.
> > * If copy_string_contents is passed an Emacs string that is not a valid
> > string, what should happen?
> How can that happen? The Emacs string comes from the Emacs bowels, so
> it must be "valid" string by Emacs standards. Or maybe I don't
> understand what you mean by "invalid Unicode string".
> A sequence of integers where at least one element is not a Unicode scalar
Emacs doesn't store characters as scalar Unicode values, so this
doesn't really explain to me your concept of a "valid Unicode string".
An Emacs string is a sequence of integers. It doesn't have to be a sequence of scalar values.
> In any case, we already deal with any such problems when we save a
> buffer to a file, or send it over the network. This isn't some new
> problem we need to cope with.
> Yes, but the module interface is new, it doesn't necessarily have to have the
> same behavior.
Of course, it does! Modules are Emacs extensions, so the interface
should support the same features that core Emacs does. Why? because
there's no limits to what creative minds can do with this feature, so
we should not artificially impose such limitations where we have
sound, time-proven infrastructure that doesn't need them.
I agree that we shouldn't add such limitations. But I disagree that we should leave the behavior undocumented in such cases.
> If we say we emit only UTF-8, then we should do so.
We emit only valid UTF-8, provided that its source (if it came from a
module) was valid UTF-8.
Then in turn we shouldn't say we emit only UTF-8.