emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs Lisp's future


From: Eli Zaretskii
Subject: Re: Emacs Lisp's future
Date: Mon, 06 Oct 2014 18:08:29 +0300

> From: Mark H Weaver <address@hidden>
> Cc: address@hidden,  address@hidden,  address@hidden,  address@hidden,  
> address@hidden,  address@hidden,  address@hidden
> Date: Mon, 06 Oct 2014 02:21:41 -0400
> 
> A related problem has to do with the fact that naively implemented UTF-8
> allows code points to be represented with more bytes than are actually
> needed, essentially by padding the code point with leading zeroes and
> then encoding with UTF-8 as if the high bits were non-zero.  For
> example, the ASCII quote (") can be represented as the single byte 0x22,
> the two byte sequence 0xC0 0xA2, etc.
> 
> UTF-8 decoders are supposed to detect and reject these "overlong"
> encodings, but it is likely that many programs fail to do this.  Such
> programs are usually vulnerable to these overlong encodings when trying
> to detect special characters (e.g. for quoting/escaping) or when
> validating inputs.
> 
> To cope with this, the Unicode standards require that UTF-8 codecs
> reject overlong encodings and other invalid byte sequences.  This is in
> direct conflict with the idea of "raw byte" code points, whose purpose
> is to be tolerant of arbitrary byte sequences and to propagate them
> unchanged.

The obvious solution is to encode the raw bytes internally in a UTF-8
compatible way.  Which is what Emacs does in its buffers and strings,
as I'm sure you know.  Can't Guile do something similar?

> FWIW, I agree that the Emacs behavior is desirable when editing a file
> that may contain coding errors, but in most other cases (e.g. when
> communicating with processes or network sockets) I think that it's more
> appropriate to refuse to accept, produce, or propagate invalid UTF-8
> such as overlong encodings.

Emacs indeed rejects them, but that doesn't mean it disallows raw
bytes as part of otherwise valid UTF-8 content.  It's a fact of life
that such stray bytes sometimes happen, and users would be generally
unhappy if Emacs would reject a file because it had such bytes.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]