Eli Zaretskii <address@hidden
> schrieb am So., 22. Nov. 2015 um 20:20 Uhr:
> From: Philipp Stephani <address@hidden>
> Date: Sun, 22 Nov 2015 18:19:29 +0000
> Cc: address@hidden, address@hidden, address@hidden
> I already suggested what we should say in the documentation: that
> these interfaces accept and produce UTF-8 encoded non-ASCII text.
> If the interface accepts UTF-8, then it must signal an error for invalid
> sequences; the Unicode standard mandates this.
The Unicode standard cannot mandate anything for Emacs, because Emacs
is not subject to Unicode standardization.
True, but I think we shouldn't make the terminology more confusing. If we say "UTF-8", we should mean "UTF-8 as defined in the Unicode standard", not the Emacs extension of UTF-8. That's all.
> If the interface produces UTF-8, then it must only ever produce valid
As I explained, this would violate the basic expectation from a text
> That's why I propose to not encode raw bytes as bytes, but as the Emacs integer
> codes used to represent them.
If we do that, no external code will be able to do anything useful
with such "bytes". Module authors will have to write their own
replacements for library functions. This will never be accepted by
I wouldn't be so pessimistic, but I was convinced by consistency with encode-coding-string. So yes, let's use the raw bytes (and document that).
> If any byte sequence is accepted, then the behavior becomes more complex. We
> need to exhaustively describe the behavior for any possible byte sequence,
> otherwise module authors cannot make any assumption.
We say that we accept valid UTF-8 encoded strings; anything else
might produce invalid UTF-8 on output.
Couldn't we just say "it behaves as if encoding and decoding were done using the utf-8-unix coding system"? Because I think that's what this boils down to.
> No matter what we expect or tolerate, we need to state that.
No, we don't. When the callers violate the contract, they cannot
expect to know in detail what will happen. If they want to know, they
will have to read the source.
So you want this to be unspecified or undefined behavior? That might be OK (we already have that in several places), but we still need to state what the contract is.
> Module authors are not end users.
They are users like anyone who writes Lisp. They came to expect that
Emacs behaves in certain ways, and modules should follow suit.
> I agree that end users should not see errors on decoding failure,
> but modules use only programmatic access, where we can be more
You cannot be more strict, unless you rewrite the whole
encoding/decoding machinery, or write specialized code to detect and
reject invalid UTF-8 before it is passed to a decoder. There are no
good reasons to do either, so let's not.
> An Emacs string is a sequence of integers.
No, it's a sequence of bytes.
"In Emacs Lisp, characters are simply integers ... A string is a fixed sequence of characters"
How a string is represented internally shouldn't be the concern of module authors.
> I agree that we shouldn't add such limitations. But I disagree that we should
> leave the behavior undocumented in such cases.
OK, so let's agree to disagree. If that disagreement gets in your way
of fixing the issues related to this discussion, please say so, and I
will fix them myself
No, I will definitely fix it. I think our disagreement is way smaller than it might look like.