emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reporting UTF-8 related problems?


From: Karl Eichwalder
Subject: Re: Reporting UTF-8 related problems?
Date: Tue, 30 Jul 2002 20:58:32 +0200
User-agent: Gnus/5.090006 (Oort Gnus v0.06) Emacs/21.3.50 (i686-pc-linux-gnu)

Kenichi Handa <address@hidden> writes:

>       &#132;Die Familie Schroffenstein&#147
>
> I thought that the notation &#NUMBER is for transmitting
> Unicode character of code NUMBER.  But, 132 and 147 are
> control codes in Unicode, not any kind of quotings.

&#NUMBERs are so called "character references"; the SGML declaration
defines which are allowed.  For HTML you must consult the html.d[e]?cl
file.  The crucial section is (HTML 2):

     BASESET   "ISO Registration Number 100//CHARSET
                ECMA-94 Right Part of
                Latin Alphabet Nr. 1//ESC 2/13 4/1"

         DESCSET  128  32   UNUSED
                  160  96    32

This basically means: &#128 to &#159 are unused.  The same applies for
HTML 4 (and later fpr XML resp. XHTML):

          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 [...]

To make the SGML parser happy you can provide a changed declaration:

          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     4      UNUSED
                 132     1      "My rising double quote left (low)"
                 133     14     UNUSED
                 147     1      "My rising double quote right (high)"
                 148     16     UNUSED
                 [...]

Untested, and the result is invalid HTML.  If they would announce a
proper HTTP header, it could be okay:

Content-Type: text/html; charset=windows-1252


Andreas Schwab <address@hidden> writes:

> The numbers are supposed to be ISO 8859-1 characters codes.  I'd guess the
> page has been written with some broken (a.k.a. W*nd*ws) software (the use
> of *.htm makes this apparent).

Yes, they have "interesting" guidelines online...

Kenichi Handa <address@hidden> writes:

> Ah, I see.  I found that windows-125X maps 132 and 147 to
> U+201E and U+201C.  So, perhaps those systems (galeon and
> lynx) parse them as U+201E and U+201C.  Anyway, how to
> encode them in X selection is their problem and Emacs can't
> do anything about it.

Yes, but once in the X selection I'd like to see Emacs honor them.

The spacing problem also occurs when I try to cut and paste from Markus
Kuhn's demo file
(http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt):

• ‚deutsche
„Anf
ührungszeichen
“

When I insert (C-x RET c utf-8 RET C-x C-f UTF-8-demo.txt RET), things
are correctly displayed (the characters are different):
• ‚deutsche‘ „Anf
ührungszeichen
“

Cut and paste both these examples from Emacs (this mail buffer) to a
UTF-8 xterm doesn't work neither; instead of the quotes I see "-1" and
garbage.

I hope the examples will go through.

-- 
address@hidden (work) / address@hidden (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]