[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improvements to `(emacs)File Variables'

From: Stefan Monnier
Subject: Re: Improvements to `(emacs)File Variables'
Date: Mon, 15 Nov 2004 00:15:05 -0500
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3.50 (gnu/linux)

> I'm not sure.  "Unibyte" as used in emacs seems (to me) to imply several
> things:  (1) of course, a single byte per character, (2) the concept of
> strings/buffers whose encoding is "unknown".

> If you were to consistently treat (2) as in fact meaning an explicit
> "binary" encoding, maybe it would be useful, but my impression is that
> at least historically, people/code have _not_ always done this, leading
> to lots and lots of confusion.  I suppose much of the reason is that
> people want the efficiency gain of (1), and either don't realize the
> problems caused by (2) or think they can kludge around it.

> As I've posted before, I think "unibyte" strings/buffers should be only
> an optimization, and should have an explicit (8-bit) encoding associated
> with them, so that any conversions to/from multibyte can automatically
> do the correct thing; one of these encoding could of course be "binary",
> which maybe would allow the historical usage of unibyte to be preserved.

I'd tend to disagree on the idea of associating an encoding with
unibyte buffers.  I think a large part of the problem is that people with
a unibyte background (i.e. latin-1 mostly) typically confuse the notion of
character and byte and mix things up hopelessly.

In Emacs-20, automatic conversion between unibyte and multibyte was provided
mostly as a way to work "correctly" even with confused code which didn't
understand that there's more than 256 characters in this world.

It made sense at the time to avoid alienating too many Emacs coders.
But to get things right, the first thing we need to do is to make it very
clear that there is no way to automatically convert between unibyte
and multibyte.  Such a conversion should only be doable via
(en|de)coding-coding-foo functions, thus forcing anyone who wants to go down
that path to actually provide a coding system explicitly and thus to think
of what coding system should be used.

After all, autoconversion can only work for 8bit encoding, so any code which
uses autoconversion is in two possible cases:
1 - the code somehow knows that all the possible encodings it might need to
    use there are 8bit.  Most likely, it's the case where there's only ever
    one encoding used.
2 - the code *doesn't* know, but just assumes (probably without even being
    aware of it) that all encodings are 8bit.  Thus it will break if used
    in China, Japan, ...
Situation 2 is a bug.  Situation 1 seems rather unusual.  My conclusion is
that autoconversion is harmful.

I've hacked my own local Emacs to "disallow" autoconversion
(i.e. auto-conversion from unibyte->multibyte is allowed and generates
eight-bit-control and eight-bit-graphic chars; auto-conversion from
multibyte to unibyte is allowed but only for ascii, eight-bit-graphic, and
eight-bit-control chars, any other char causes an error).  It actually works
fairly well.  The main problems I encounter have to do with regexp matching
where the regexp is multibyte and the text is unibyte.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]