Eli Zaretskii <address@hidden
> schrieb am Sa., 21. Nov. 2015 um 14:23 Uhr:
> From: Philipp Stephani <address@hidden>
> Date: Sat, 21 Nov 2015 12:11:45 +0000
> Cc: address@hidden, address@hidden, address@hidden
> No, we cannot, or rather should not. It is unreasonable to expect
> external modules to know the intricacies of the internal
> representation. Most Emacs hackers don't.
> Fine with me, but how would we then represent Emacs strings that are not valid
> Unicode strings? Just raise an error?
No need to raise an error. Strings that are returned to modules
should be encoded into UTF-8. That encoding already takes care of
these situations: it either produces the UTF-8 encoding of the
equivalent Unicode characters, or outputs raw bytes.
Then we should document such a situation and give module authors a way to detect them. For example, what happens if a sequence of such raw bytes happens to be a valid UTF-8 sequence? Is there a way for module code to detect this situation?
We are using this all the time when we save files or send stuff over
> No, I meant strict UTF-8, not its Emacs extension.
> That would be possible and provide a clean interface. However, Emacs strings
> are extended, so we'd need to specify how they interact with UTF-8 strings.
> * If a module passes a char sequence that's not a valid UTF-8 string, but a
> valid Emacs multibyte string, what should happen? Error, undefined behavior,
> silently accepted?
We are quite capable of quietly accepting such strings, so that is
what I would suggest. Doing so would be in line with what Emacs does
when such invalid sequences come from other sources, like files.
If we accept such strings, then we should document what the extensions are.
- Are UTF-8-like sequences encoding surrogate code points accepted?
- Are UTF-8-like sequences encoding integers outside the Unicode codespace accepted?
- Are non-shortest forms accepted?
- Are other invalid code unit sequences accepted?
If the answer to any of these is "yes", we can't say we accept UTF-8, because we don't. Rather we should say what is actually accepted.
> * If copy_string_contents is passed an Emacs string that is not a valid Unicode
> string, what should happen?
How can that happen? The Emacs string comes from the Emacs bowels, so
it must be "valid" string by Emacs standards. Or maybe I don't
understand what you mean by "invalid Unicode string".
A sequence of integers where at least one element is not a Unicode scalar value.
In any case, we already deal with any such problems when we save a
buffer to a file, or send it over the network. This isn't some new
problem we need to cope with.
Yes, but the module interface is new, it doesn't necessarily have to have the same behavior. If we say we emit only UTF-8, then we should do so.
> OK, then we can use that, of course. The question of handling invalid UTF-8
> strings is still open, though, as make_multibyte_string doesn't enforce valid
It doesn't enforce valid UTF-8 because it can handle invalid UTF-8 as
well. That's by design.
Then whatever it handles needs to be specified.
> If it's the contract of make_multibyte_string that it will always accept UTF-8,
> then that should be added as a comment to that function. Currently I don't see
> it documented anywhere.
That part of the documentation is only revealed to veteran Emacs
hackers, subject to swearing not to reveal that to the uninitiated and
to some blood-letting that seals the oath ;-)
I see ;-)