[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: Re: Inadequate documentation of silly characters on screen.

From: Stefan Monnier
Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Date: Thu, 19 Nov 2009 09:08:29 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux)

> The above sequence "works" in Emacs 22.3, in the sense that "ñ" gets
> displayed

There are many differences that cause it to work completely differently:

> - when I do M-: (aset nl 0 ?ñ), I get

>    "2289 (#o4361, #x8f1)" (Emacs 22.3)
>    "241 (#o361, #xf1)"    (Emacs 23.1)

?ñ = 2289 in Emacs-22
?ñ = 241  in Emacs-23

So in Emacs-22, there is no possible confusion for this char with
a byte.
So when you do the `aset', Emacs-22 converts the unibyte string nl to
multibyte, whereas Emacs-23 doesn't.  From then on, in Emacs-22 your
example is all multibyte, so there's no surprise.

Now if in Emacs-22 you do instead (aset nl 0 241), where 241 in Emacs-22
is not a valid char and can hence only be a byte, then aset leaves the
string as unibyte and we end up with the same nl as in Emacs-23.  But if
you then (insert nl), Emacs-22 will probably end up inserting a ñ in
your buffer, because Emacs-22 performs a decoding step using your
language environment when inserting a unibyte string into a unibyte
buffer (this used to be helpful for code that didn't know enough about
Mule to setup coding systems properly, which is why it was done, but
nowadays it was just hiding bugs and encouraging sloppiness in coding so
we removed it).

> fix it before the pretest?  How about interpreting "\n" and friends as
> multibyte or unibyte according to the prevailing flavour?

I'm not sure what that means.  But maybe "\n" should be multibyte, yes.

>> If you give us more context (i.e. more of the real code where the
>> problem show up), maybe we can tell you how to avoid it.

> OK.  I have my own routine to display regexps.  As a first step, I
> translate \n -> ñ, (and \t, \r, \f similarly).  This is how:

>     (defun translate-rnt (regexp)
>       "REGEXP is a string.  Translate any \t \n \r and \f characters
>     to wierd non-ASCII printable characters: \t to Î (206, \xCE), \n
>     to ñ (241, \xF1), \r to ® (174, \xAE) and \f to £ (163, \xA3).
>     The original string is modified."
>       (let (ch pos)
>         (while (setq pos (string-match "[\t\n\r\f]" regexp))
>           (setq ch (aref regexp pos))
>           (aset regexp pos                        ; <===================
>                 (cond ((eq ch ?\t) ?Î)
>                       ((eq ch ?\n) ?ñ)
>                       ((eq ch ?\r) ?®)
>                       (t           ?£))))
>         regexp))

Each one of those `aset' (when performed according to your wishes) would
change the byte-size of the string, so it would internally require
copying the whole string each time: aset on (multibyte) strings is very
inefficient (compared to what most people expect, not necessarily
compared to other operations).  I'd recommend you use higher-level
operations since they'll work just as well and are less susceptible to
such problems:

  (replace-regexp-in-string "[\t\n\r\f]"
                            (lambda (s)
                              (or (cdr (assoc s '(("\t" . "Î")
                                                  ("\n" . "ñ")
                                                  ("\r" . "®"))))

> Why do we have both unibyte and multibyte?  Is there any reason
> not to remove unibyte altogether (though obviously not for 23.2).

Because bytes and chars are different, so we have strings of bytes and
strings of chars.  The problem with it is not their combined existence,
but the fact that they are not different enough.  Many people don't
understand the difference between chars and bytes, but even more people
can't figure out which Elisp operation returns a unibyte string and
which a multibyte strings, and that for a "good" reason: it's very
difficult to predict.

Emacs-23 tries to help in this in the following ways:
- `string' always builds a multibyte string now, so if you want
  a unibyte string, you need to use the new `unibyte-string' function.
- we don't automatically perform encoding/decoding conversions between
  the two forms, so we hide the difference a bit less.

We should probably moved towards making all string immediates multibyte
and add a new syntax to unibyte immediates.

> What was the change between 22.3 and 23.1 that broke my code?

Mostly: the change to unibyte internal representation which made 241
(and other byte values) ambiguous since it can also be interpreted now
as a character value.

> Would it, perhaps, be a good idea to reconsider that change?

I think you'll understand that reverting to the emacs-mule
(iso-2022-based) internal representation is not really on the table ;-)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]