Re: String encoding in json.c

Eli Zaretskii <address@hidden> schrieb am Sa., 23. Dez. 2017 um 16:53 Uhr:

> From: Philipp Stephani <address@hidden>
> Date: Sat, 23 Dec 2017 15:31:06 +0000
> Cc: address@hidden
>
> The coding operations are "expensive no-ops" except when they aren't,
> and that is exactly when we need their 'expensive" parts.
>
> In which case are they not no-ops?

When the input is not a valid UTF-8 sequence. When that happens, we
produce a special representation of such raw bytes instead of
signaling EILSEQ and refusing to decode the input. Encoding (if and
when it is done) then performs the opposite conversion, producing the
same single raw byte in the output stream. This allows Emacs to
manipulate text that included invalid sequences without crashing,
because all the low-level primitives that walk buffer text and strings
by characters assume the internal representation of each character is
valid.

OK, thanks for the refresher. I was aware of the single byte representation, but forgot how exactly it's handled during coding.

> Using utf-8-unix as encoding seems to keep the encoding intact.

First, you forget about decoding.

OK, let's treat encoding and decoding separately.

- We encode Lisp strings when passing them to Jansson. Jansson only accepts UTF-8 strings and fails (with proper error reporting, not crashing) when encountering non-UTF-8 strings. I think encoding can only make a difference here for strings that contain sequences of bytes that are themselves valid UTF-8 code unit sequences, such as "Ä\xC3\x84". This string is encoded as "\xC3\x84\xC3\x84" using utf-8-unix. (Note how this is a case where encoding and decoding are not inverses of each other.) Without encoding, the string contents will be \xC3\x84 plus two invalid 5-byte sequences. I think it's not obvious at all which interpretation is correct; after all, "Ä\xC3\x84" is not equal to "ÄÄ", but the two strings now result in the same JSON representation. This could be at least surprising, and I'd argue that the other behavior (raising an error) would be more correct and more obvious.

- We decode UTF-8 strings after receiving them from Jansson. Jansson guarantees to only ever emit well-formed UTF-8. Given that for well-formed UTF-8 strings, the UTF-8 representation and the Emacs representation are one and the same, we don't need decoding.

And second, encoding keeps the
encoding intact precisely because it is not a no-op: raw bytes are
held in buffer and string text as special multibyte sequences, not as
single bytes, so just copying them to output instead of encoding will
produce non-UTF-8 multibyte sequences.

That's the correct behavior, I think. JSON values must be valid Unicode strings, and raw bytes are not.

> I've spot-checked some other code where we interface with external libraries, namely dbusbind.c and
> gnutls.c. In no cases I've found explicit coding operations (except for filenames, where the situation is
> different); these files always use SDATA directly. dbusbind.c even has the comment
>
> /* We need to send a valid UTF-8 string. We could encode `object'
> but by not encoding it, we guarantee it's valid utf-8, even if
> it contains eight-bit-bytes. Of course, you can still send
> manually-crafted junk by passing a unibyte string. */

If gnutls.c and dbusbind.c don't encode and decode text that comes
from and goes to outside, then they are buggy.

Not necessarily. As mentioned, the internal encoding of multibyte strings is even mentioned in the Lisp reference; and the above comment indicates that it's OK to use that information at least within the Emacs codebase.

BTW, that comment was added by Stefan in commit e454a4a330cc6524cf0d2604b4fafc32d5bda795, where he removed an explicit encoding step.

(At least for
gnutls.c, I think you are mistaken, because the encoding/decoding is
in process.c, see, e.g., read_process_output.)

Some parts are definitely encoded, but for example, there is c_hostname in Fgnutls_boot, which doesn't encode the user-supplied string.

> It's the *current* json.c (and emacs-module.c) that's inconsistent
> with the rest of the codebase.

Well, I disagree with that conclusion. Just look at all the calls to
decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
and you will see where we do that.

We obviously do *some* encoding/decoding. But when interacting with third-party libraries, we seem to leave it out pretty frequently, if those libraries use UTF-8 as well.

From:	Philipp Stephani
Subject:	Re: String encoding in json.c
Date:	Sat, 23 Dec 2017 17:27:22 +0000