[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: Re: Inadequate documentation of silly characters on screen.

From: Alan Mackenzie
Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Date: Thu, 19 Nov 2009 18:08:48 +0000
User-agent: Mutt/1.5.9i

Hi, David!

On Thu, Nov 19, 2009 at 05:55:10PM +0100, David Kastrup wrote:
> Alan Mackenzie <address@hidden> writes:

> > On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote:
> >> > The actual character in the string is ñ (#x3f).

> >> No: the string does not contain any characters, only bytes, because
> >> it's a unibyte string.

> > I'm thinking from the lisp viewpoint.  The string is a data
> > structure which contains characters.  I really don't want to have to
> > think about the difference between "chars" and "bytes" when I'm
> > hacking lisp.  If I do, then the abstraction "string" is broken.

> >> So it contains the byte 241, not the character ñ.

> > That is then a bug.  I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)".

> Huh?  ?ñ is the Emacs code point of ñ.  Which is pretty much identical
> to the Unicode code point in Emacs 23.

No, you (all of you) are missing the point.  That point is that if an
Emacs Lisp hacker writes "?ñ", it should work, regardless of
what "codepoint" it has, what "bytes" represent it, whether those
"bytes" are coded with a different codepoint, or what have you.  All of
that stuff is uninteresting.  If it gets interesting, like now, it is
because it is buggy.

> >> The byte 241 can be inserted in multibyte strings and buffers
> >> because it is also a char of code 4194289 (which gets displayed as
> >> \361).

OK.  Surely displaying it as "\361" is a bug?  Should it not display as
"\17777761".  If it did, it would have saved half of my ranting.

> > Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?
> > This is some strange usage of the word "be" that I wasn't previously
> > aware of.  ;-)

> Emacs encodes most of its things in utf-8.  A Unicode code point is an
> integer.  You can encode it in different encodings, resulting in
> different byte streams.  Inside of a byte stream encoded in utf-8, the
> isolated byte 241 does not correspond to a Unicode character.  It is not
> valid utf-8.  When Emacs reads a file supposedly in utf-8, it wants to
> represent _all_ possible byte streams in order to be able to save
> unchanged data unmolested.

That's a good explanation - it's sort of like &lt; in html.  Thanks.

> So it encodes the entity "illegal isolated byte 241 in an utf-8
> document" with the character code 4194289 which has a representation in
> Emacs' internal variant of utf-8, but is outside of the range of
> Unicode.

So, how did the character "ñ" get turned into the illegal byte #xf1?  Is
that the bug?

> > At this point, would you please just agree with me that when I do

> >    (setq nl "\n")
> >    (aset nl 0 ?ñ)
> >    (insert nl)

> > , what should appear on the screen should be "ñ", NOT "\361"?  Thanks!

> You assume that ?ñ is a character.

I do indeed.  It is self evident.

Now, would you too please just agree that when I execute the three forms
above, and "ñ" should appear?

The identical argument applies to "ä".  They are character used in
writing wierd European languages like Spanish and German.  Emacs should
not have difficulty with them.  It is a standard Emacs idiom that ?x (or
?\x) is the integer representing the character x.  Indeed (unlike in
XEmacs), characters ARE integers.  Why does this not work for, e.g.,

> But in Emacs, it is an integer, a Unicode code point in Emacs 23.

That sounds like the sort of argument one might read on
gnu-misc-discuss.  ;-)  Sorry.  Are you saying that Emacs is converting
"?ñ" and "?ä" into the wrong integers? 

> As long as there is something like a unibyte string, there is no way
> to distinguish the character 241 and the byte 241 except when Emacs is
> told explicitly.

What is the correct Emacs internal representation for "ñ" and "ä"?  They
surely cannot share internal representations with other

> Because Emacs has no separate "character" data type.

For which I am thankful.

> -- 
> David Kastrup

Alan Mackenzie (Nuremberg, Germany).

reply via email to

[Prev in Thread] Current Thread [Next in Thread]