[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12291: [rev 109796] wrong UTF-8 handling

From: Kenichi Handa
Subject: bug#12291: [rev 109796] wrong UTF-8 handling
Date: Tue, 28 Aug 2012 23:57:39 +0900

In article <address@hidden>, Werner LEMBERG <address@hidden> writes:

> Have a look at the attached file, containing a single character.
> (It's transmitted as binary to avoid e-mail encoding issues).  It
> contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
> 0x9E, which would map to the non-existent Unicode character code
> U+1351DE).  If I load this file as UTF-8 encoded, Emacs gives this as
> the output of `C-u C-x =':

>                position: 1 of 2 (0%), column: 0
>               character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
> Look what Emacs says about the file code.  If I save this
> one-character file as UTF-8, the character code stays as-is.

> This behaviour is clearly wrong.


> I suspect that Emacs is using such a
> high character code for internal representation of the `emacs-mule'
> encoding.  However, the user must not see this.  

That higher character code area is used for two purposes.

One is for reading CJK characters of legacy encoding (euc,
sjis, big5, etc).  They are decoded into the utf-8-emacs
byte sequence corresponding to the higher character cod
area.  But, on getting their character code, most of them
are unified into Unicode BMP characters.  But few are left
un-unified.  Those are private characters in each legacy
character set.

Another is for supporting non-Unicode characters.  The
biggest set is GB18030.

In both cases, user surely see them.

> Instead, such characters must be converted to correct
> UTF-8.

??? I don't understand what you means by "correct UTF-8".

I think the correct behaviour on reading such a file by
utf-8 is to treat each byte as raw-byte.

Kenichi Handa

reply via email to

[Prev in Thread] Current Thread [Next in Thread]