Eli Zaretskii <address@hidden
> schrieb am Sa., 21. Nov. 2015 um 12:10 Uhr:
> (Btw, I don't think we should worry about changing the internal
> representation of characters in Emacs, because make_multibyte_string
> will be updated as needed.)
> This is a crucial point. If the internal encoding never changes, then we can
> declare that those string parameters are expected to be in the internal
No, we cannot, or rather should not. It is unreasonable to expect
external modules to know the intricacies of the internal
representation. Most Emacs hackers don't.
Fine with me, but how would we then represent Emacs strings that are not valid Unicode strings? Just raise an error?
> But see the discussion in
> https://github.com/aaptel/emacs-dynamic-module/issues/37: the comment in
> mule-conf.el seems to indicate that the internal encoding is not stable.
That discussion is about zero-copy access to Emacs buffer text and
Emacs strings inside module code.
Partially, the encoding discussion is also part of that because it's required to specify the encoding before zero-copy access is even possible.
Such access is indeed impossible
without either knowing _something_ about the internal representation,
or having additional APIs in emacs-module.c that allow modules such
access while hiding the details of the internal representation. We
could discuss extending the module functionality to include this.
Yes, there's no need for that in this subthread though.
But that is a separate issue from what module_make_function and
module_make_string do. These two functions are basic, and don't need
to know about the internal representation or use it. While direct
access to Emacs buffer text will be needed by only some modules,
module_make_function will be used by all of them, and
module_make_string by many.
So I think we shouldn't conflate these two issues; they are separate.
> This is what my comments were about. I think that you, by contrast,
> are talking about the encoding of the _input_ strings, in this case
> the 'documentation' argument to module_make_function and 'str'
> argument to module_make_string. My assumption was that these
> arguments will always have to be in UTF-8 encoding; if that assumption
> is true, then no decoding via code_convert_string_norecord is
> necessary, since make_multibyte_string will DTRT. We can (and
> probably should) document the fact that all non-ASCII strings must be
> UTF-8 encoded as a requirement of the emacs-module interface.
> Or rather, an extension to UTF-8 capable of encoding surrogate code points and
> numbers that are not code points, as described in
No, I meant strict UTF-8, not its Emacs extension.
That would be possible and provide a clean interface. However, Emacs strings are extended, so we'd need to specify how they interact with UTF-8 strings.
- If a module passes a char sequence that's not a valid UTF-8 string, but a valid Emacs multibyte string, what should happen? Error, undefined behavior, silently accepted?
- If copy_string_contents is passed an Emacs string that is not a valid Unicode string, what should happen? Error, or should the internal representation be silently leaked?
> If it's stable, we can use make_multibyte_string; if not, we can
> only use make_unibyte_string.
If the arguments strings are in strict UTF-8, then
make_multibyte_string will DTRT automagically, no matter what the
internal representation is. That is their contract.
OK, then we can use that, of course. The question of handling invalid UTF-8 strings is still open, though, as make_multibyte_string doesn't enforce valid UTF-8.
If it's the contract of make_multibyte_string that it will always accept UTF-8, then that should be added as a comment to that function. Currently I don't see it documented anywhere.