emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: String encoding in json.c


From: Eli Zaretskii
Subject: Re: String encoding in json.c
Date: Sat, 23 Dec 2017 17:53:38 +0200

> From: Philipp Stephani <address@hidden>
> Date: Sat, 23 Dec 2017 15:31:06 +0000
> Cc: address@hidden
> 
>  The coding operations are "expensive no-ops" except when they aren't,
>  and that is exactly when we need their 'expensive" parts.
> 
> In which case are they not no-ops?

When the input is not a valid UTF-8 sequence.  When that happens, we
produce a special representation of such raw bytes instead of
signaling EILSEQ and refusing to decode the input.  Encoding (if and
when it is done) then performs the opposite conversion, producing the
same single raw byte in the output stream.  This allows Emacs to
manipulate text that included invalid sequences without crashing,
because all the low-level primitives that walk buffer text and strings
by characters assume the internal representation of each character is
valid.

> Using utf-8-unix as encoding seems to keep the encoding intact.

First, you forget about decoding.  And second, encoding keeps the
encoding intact precisely because it is not a no-op: raw bytes are
held in buffer and string text as special multibyte sequences, not as
single bytes, so just copying them to output instead of encoding will
produce non-UTF-8 multibyte sequences.

> I've spot-checked some other code where we interface with external libraries, 
> namely dbusbind.c and
> gnutls.c. In no cases I've found explicit coding operations (except for 
> filenames, where the situation is
> different); these files always use SDATA directly. dbusbind.c even has the 
> comment
> 
>   /* We need to send a valid UTF-8 string.  We could encode `object'
>      but by not encoding it, we guarantee it's valid utf-8, even if
>      it contains eight-bit-bytes.  Of course, you can still send
>      manually-crafted junk by passing a unibyte string.  */

If gnutls.c and dbusbind.c don't encode and decode text that comes
from and goes to outside, then they are buggy.  (At least for
gnutls.c, I think you are mistaken, because the encoding/decoding is
in process.c, see, e.g., read_process_output.)

> It's the *current* json.c (and emacs-module.c) that's inconsistent
> with the rest of the codebase.

Well, I disagree with that conclusion.  Just look at all the calls to
decode_coding_*, encode_coding_*, DECODE_SYSTEM, ENCODE_SYSTEM, etc.,
and you will see where we do that.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]