[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dynamic loading progress

From: Eli Zaretskii
Subject: Re: Dynamic loading progress
Date: Sun, 22 Nov 2015 20:04:28 +0200

> From: Philipp Stephani <address@hidden>
> Date: Sun, 22 Nov 2015 14:56:12 +0000
> Cc: address@hidden, address@hidden, address@hidden
> - The multibyte API should use an extension of UTF-8 to encode Emacs strings.
> The extension is the obvious one already in use in multiple places.

It is only used in one place: the internal representation of
characters in buffers and strings.  Emacs _never_ lets this internal
representation leak outside.  In practice the last sentence means that
text that Emacs encoded in UTF-8 will only include either valid UTF-8
sequences of characters whose codepoints are below #x200000 or single
bytes that don't belong to any UTF-8 sequence.

You are suggesting to expose the internal representation to outside
application code, which predictably will cause that representation to
leak into Lisp.  That'd be a disaster.  We had something like that
back in the Emacs 20 era, and it took many years to plug those leaks.
We would be making a grave mistake to go back there.

What you suggest is also impossible without deep changes in how we
decode and encode text: that process maps codepoints above #1FFFFF to
either codepoints below that mark or to raw bytes.  So it's impossible
to produce these high codes in UTF-8 compatible form while handling
UTF-8 text.  To say nothing about the simple fact that no library
function in any C library will ever be able to do anything useful with
such codepoints, because they are our own invention.

> - There should be a one-to-one mapping between Emacs multibyte strings and
> encoded module API strings.

UTF-8 encoded strings satisfy that requirement.

> Therefore non-shortest forms, illegal code unit sequences, and code
> unit sequences that would encode values outside the range of Emacs
> characters are illegal and raise a signal.

Once again, this was tried in the past and was found to be a bad idea.
Emacs provides features to test the result of converting invalid
sequences, for the purposes of detecting such problems, but it leaves
that to the application.

> Likewise, such sequences will never be returned from Emacs.

Emacs doesn't return invalid sequences, if the original text didn't
include raw bytes.  If there were raw bytes in the original text,
Emacs has no choice but return them back, or else it will violate a
basic expectation from a text-processing program: that it shall never
change the portions of text that were not affected by the processing.

> I think this is a relatively simple and unsurprising approach. It allows
> encoding the documented Emacs character space while still being fully
> compatible with UTF-8 and not resorting to undocumented Emacs internals.

So does the approach I suggested.  The advantage of my suggestion is
that it follows a long Emacs tradition about every aspect of encoding
and decoding text, and doesn't require any changes in the existing

reply via email to

[Prev in Thread] Current Thread [Next in Thread]