emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: String encoding in json.c


From: Philipp Stephani
Subject: Re: String encoding in json.c
Date: Sat, 23 Dec 2017 15:31:06 +0000



Eli Zaretskii <address@hidden> schrieb am Sa., 23. Dez. 2017 um 15:43 Uhr:
> From: Philipp Stephani <address@hidden>
> Date: Sat, 23 Dec 2017 14:26:09 +0000
>
> I've benchmarked serialization and parsing of JSON with and without explicit encoding. I've found that leaving
> out the coding makes both operations significantly faster – from a speedup of a factor of 1.11 ± 0.06 for
> parsing canada.json to 1.57 ± 0.08 for serializing twitter.json. Other speedups are in between, but the
> speedup is always significant (to at least one standard deviation). All unit tests pass when leaving out the
> coding steps – which isn't surprising given that currently the coding operations are expensive no-ops.

The coding operations are "expensive no-ops" except when they aren't,
and that is exactly when we need their 'expensive" parts.

In which case are they not no-ops? I've spot-checked some of the implementation details of coding.c, and I haven't found obvious cases where they are not no-ops. Emacs appears to use the obvious extension of UTF-8 for integers that are not Unicode scalar values, and that's even documented in character.h and the Elisp reference manual. Using utf-8-unix as encoding seems to keep the encoding intact.
 

> Therefore I'd suggest to document the internal string encoding in lisp.h or character.h and remove the explicit
> coding in json.c and emacs-module.c. It's very unlikely that the internal string encoding will change frequently,
> and if so, the unit tests should catch potential issues caused by that.

As I've already said, I don't think this particular case should be an
exception wrt to how Emacs behaves with external strings everywhere
else.  We suffer similar slow-downs in those other places as well, and
IMO this is a small penalty to pay for making sure our objects are
valid and won't crash Emacs.

I've spot-checked some other code where we interface with external libraries, namely dbusbind.c and gnutls.c. In no cases I've found explicit coding operations (except for filenames, where the situation is different); these files always use SDATA directly. dbusbind.c even has the comment

  /* We need to send a valid UTF-8 string.  We could encode `object'
     but by not encoding it, we guarantee it's valid utf-8, even if
     it contains eight-bit-bytes.  Of course, you can still send
     manually-crafted junk by passing a unibyte string.  */

So not only do we not encode strings explicitly, we even *prefer* not encoding them, and we do rely on the internal string encoding being an extension of UTF-8. It's the *current* json.c (and emacs-module.c) that's inconsistent with the rest of the codebase.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]