bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs


From: Vasilij Schneidermann
Subject: bug#27270: display-raw-bytes-as-hex generates ambiguous output for Emacs strings
Date: Sun, 24 Apr 2022 11:56:04 +0200

> > I tend to think that introducing a new syntax just to fix it
> > isn't worth it.
>
> That's fine, so let's fix the problem as originally suggested. That is,
> display the string returned by (format "%c%c" #x9e #x66) as "\x9e\x66"
> (equivalent to (concat "\x9e" "\x66") which is correct) instead of as
> "\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is wrong).
>
> This fixes the problem and doesn't introduce new syntax.

Wait, hold up. Under which conditions exactly does the bug happen? If I
use GUI Emacs, thanks to font-lock it's pretty obvious that the output
is three bytes, the first one displayed using the hex escape syntax and
the remaining two bytes using hex letters.  If I copy-paste those into
another GUI Emacs, it's still the same three bytes. I don't know about
terminal Emacs, but trying to work around terminals being bad doesn't
seem worth the extra effort.

Besides, suppose it is worth it, what exactly should the logic be here?
Detect if there's a preceding hex escaped byte and if yes, display
adjacent bytes that are formatted using hex characters using escaping,
too? That seems too involved for something run in redisplay.

The other proposed alternative of tightening up read syntax seems
incompatible, but saner to me overall. Emacs Lisp is the odd one out
here anyway. Only C and C++ consider such sequences as potentially
having a greater length than 2 and they error out with a compilation
error for me.

    len("\x1234") # Python, Go: 3

    "\x1234".length # Ruby, JavaScript: 3

    length("\x1234") # Perl: 3

    (string-length "\x1234") ; Guile, Racket, CHICKEN: 3

    ;; Common Lisp absent because it lacks a lot of string escapes and
    ;; using FORMAT neatly sidesteps these issues

    ;; Clojure only has octal/unicode string escapes
    (count (seq "\u12345678")) ; Clojure: 5

    (length "\x1234") ; Emacs Lisp: 1

    strlen("\x1234") /* C: compilation error */

    std::string("\x1234").length() // C++: compilation error

    "\x1234".len() // Rust: 3

Before deciding on such a change, there should be efforts to figure out
whether anything could actually break due to this. That is, code with
long hex escapes in strings, be it manually authored (unlikely, people
either use raw bytes in strings or unicode escapes) or automatically
generated (cannot comment on that, maybe the byte-code compiler emits
such code?). If not, then it would be an obvious candidate for the next
major release of Emacs.

On Sun, Apr 24, 2022 at 9:10 AM Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> On 4/23/22 07:00, Lars Ingebrigtsen wrote:
> > we've had this format for half a decade now, and this doesn't
> > really seem to be a problem in practice
>
> Not surprising, since most people don't set display-raw-bytes-as-hex.
> But that doesn't mean it's not a problem. Quoting bugs can be issues
> even if they're unlikely to occur at random. (Think SQL injection. :-)
>
>
> > I tend to think that introducing a new syntax just to fix it
> > isn't worth it.
>
> That's fine, so let's fix the problem as originally suggested. That is,
> display the string returned by (format "%c%c" #x9e #x66) as "\x9e\x66"
> (equivalent to (concat "\x9e" "\x66") which is correct) instead of as
> "\x9ef" (equivalent to "\N{BENGALI DIGIT NINE}" which is wrong).
>
> This fixes the problem and doesn't introduce new syntax.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]