bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Emacs, XEmacs, X11(?), "man"(?) i18n/utf-8 brokenness


From: Olaf Klischat
Subject: Emacs, XEmacs, X11(?), "man"(?) i18n/utf-8 brokenness
Date: Tue, 24 May 2005 21:19:39 +0200
User-agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)

[F'up to gnu.utils.bug]

First, this one:

http://user.cs.tu-berlin.de/~klischat/emacs-i18n-broken-by-design.png

I.e. several instances of the german umlaut "ΓΌ" in a buffer, some of
which are found by isearch, while others aren't. The ones that are
found were entered directly, the ones that aren't were copied in from
a Gnus article buffer displaying an ISO-8859-15 encoded news
posting. The Emacs instance runs under a UTF8 locale
($LANG="en_US.UTF-8"). Looks like a design error to me -- it should
store buffer contents internally as a sequence of Unicode codepoints,
not as sequences of bytes + encoding (which is what I presume it
does atm).

I could only reproduce this in GNU Emacs, not in XEmacs, but then, I
had no Gnus running under XEmacs, so I copied the letters over using
X11 C&P from a GNU Emacs to an XEmacs for the XEmacs test, and god
knows what side effects that had.


Second, this one:

http://user.cs.tu-berlin.de/~klischat/xemacs-utf8-manpage-fuckage.png

This happens in XEmacs only; GNU Emacs *displays* the manpage just
fine, but copying&pasting code over to an editor or an interactive
Perl interpreter and running it is still impossible; see below. Both
XEmacs and Emacs are running under the UTF-8 locale mentioned above.

When running under that locale, the "man" program (or is it nroff, or
troff, or groff?), for reasons that are beyond me, decides to turn the
perfectly valid ASCII chracter 0x27 ("'", U+0027 APOSTROPHE) into the
UTF-8 sequence 0xe2 0x80 0x99 [1], which, according to
http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder,
is the chracter U+2019 RIGHT SINGLE QUOTATION MARK (similar things
happen with the "-" chracter, and probably others). This makes it
impossible to copy&paste code contaning one of those characters into a
source code file and execute it. Furthermore, XEmacs apparently can't
even decode UTF-8 and display those characters properly; it only
displays the weird escape codes, as shown in my second picture
above. GNU Emacs displays the characters correctly, but still hands
those automagically converted UTF-8 byte sequences to the outside
world when copying&pasting. When copying said "weird escape codes"
from XEmacs to Emacs, they mysteriously show up in Emacs as "weird
escape codes" as well, not as the corresponding characters.

I don't know who is to blame for all this. Are those automatic
character conversions mandated by some standard? Is a programming
language interpreter/compiler supposed to treat U+2019 like a regular
"'" character?

All things considered, it seems that it is still quite impossible (or
should one say "adventurous"?) to use GNU and Emacs for programming
tasks under multibyte encodings.

Software used:

Linux 2.4.27
glibc 2.3.2.ds1-20
XEmacs 21.4 (patch 17)
GNU Emacs 21.3.1


[1]
address@hidden:~$ unset LANG
address@hidden:~$ man -P cat perlref | grep arrayref | grep '1, 2'
Reformatting perlref(1), please wait...
               $arrayref = [1, 2, ['a', 'b', 'c']];
address@hidden:~$ man -P cat perlref | grep arrayref | grep '1, 2' | myhexdump 
Reformatting perlref(1), please wait...
0000000  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 24                  $
0000010  61 72 72 61 79 72 65 66 20 3d 20 5b 31 2c 20 32   arrayref = [1, 2
0000020  2c 20 5b 27 61 27 2c 20 27 62 27 2c 20 27 63 27   , ['a', 'b', 'c'
0000030  5d 5d 3b 0a                                       ]];.
address@hidden:~$ 


vs.


address@hidden:~$ echo $LANG
en_US.UTF-8
address@hidden:~$ man -P cat perlref | grep arrayref | grep '1, 2'
Reformatting perlref(1), please wait...
               $arrayref = [1, 2, ['a', 'b', 'c']];
address@hidden:~$ man -P cat perlref | grep arrayref | grep '1, 2' | myhexdump 
Reformatting perlref(1), please wait...
0000000  20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 24                  $
0000010  61 72 72 61 79 72 65 66 20 3d 20 5b 31 2c 20 32   arrayref = [1, 2
0000020  2c 20 5b e2 80 99 61 e2 80 99 2c 20 e2 80 99 62   , [...a..., ...b
0000030  e2 80 99 2c 20 e2 80 99 63 e2 80 99 5d 5d 3b 0a   ..., ...c...]];.
address@hidden:~$ 

(I can't reproduce this behaviour with iconv, btw)

(I had to manually edit out the U+2019 characters from the last
annotation so Gnus and XEmacs don't turn the whole article into a
multipart/mixed mess)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]