[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#12291: [rev 109796] wrong UTF-8 handling
From: |
Werner LEMBERG |
Subject: |
bug#12291: [rev 109796] wrong UTF-8 handling |
Date: |
Tue, 28 Aug 2012 07:47:20 +0200 (CEST) |
[bzr revision 109796]
Have a look at the attached file, containing a single character.
(It's transmitted as binary to avoid e-mail encoding issues). It
contains a single, four-byte UTF-8 encoded character (0xF4 0xB5 0x87
0x9E, which would map to the non-existent Unicode character code
U+1351DE). If I load this file as UTF-8 encoded, Emacs gives this as
the output of `C-u C-x =':
position: 1 of 2 (0%), column: 0
character: 二 (displayed as 二) (codepoint 20108, #o47214, #x4e8c)
preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0x4E8C
syntax: w which means: word
category: .:Base, C:2-byte han, L:Left-to-right (strong),
c:Chinese, h:Korean, j:Japanese, |:line breakable
to input: type "C-x 8 RET HEX-CODEPOINT" or "C-x 8 RET NAME"
buffer code: #xE4 #xBA #x8C
file code: #xE4 #xBA #x8C (encoded by coding system utf-8-unix)
display: by this font (glyph code)
xft:-unknown-SimSun-normal-normal-normal-*-24-*-*-*-d-0-iso10646-1 (#x460)
Character code properties: customize what to show
name: CJK IDEOGRAPH-4E8C
general-category: Lo (Letter, Other)
decomposition: (20108) ('二')
Look what Emacs says about the file code. If I save this
one-character file as UTF-8, the character code stays as-is.
This behaviour is clearly wrong. I suspect that Emacs is using such a
high character code for internal representation of the `emacs-mule'
encoding. However, the user must not see this. Instead, such
characters must be converted to correct UTF-8.
Werner
======================================================================
In GNU Emacs 24.2.50.1 (i686-pc-linux-gnu, GTK+ Version 2.24.9)
of 2012-08-28 on linux-nvf0
Windowing system distributor `The X.Org Foundation', version 11.0.11004000
Configured using:
`configure 'MAKEINFO=/usr/bin/makeinfo' '--with-x-toolkit=gtk''
Important settings:
value of $LANG: de_DE.UTF-8
value of $XMODIFIERS: @im=none
locale-coding-system: utf-8-unix
default enable-multibyte-characters: t
Major mode: Summary
Minor modes in effect:
tooltip-mode: t
mouse-wheel-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
blink-cursor-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
column-number-mode: t
transient-mark-mode: t
Recent input:
<return> w b u g - e m <tab> <tab> <tab> <tab> <tab>
<tab> <tab> <backspace> <backspace> <tab> <tab> C-c
C-q y M-x w r i t e - e m <tab> C-g C-h a b u g <return>
<M-next> C-x 1 M-x r e p r t <backspace> <backspace>
o r t - e m <tab> <return>
Recent messages:
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft is prepared
No matching alias [7 times]
Kill draft message? (y or n) y
Saving file /home/wl/Mail/draft/11...
Wrote /home/wl/Mail/draft/11
Draft was killed
Quit
Type C-x 4 C-o RET to restore the other window.
Load-path shadows:
None found.
Features:
(shadow emacsbug message format-spec rfc822 mml mml-sec mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
sendmail rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils
apropos descr-text latexenc preview prv-emacs byte-opt tex-buf
noutline outline font-latex warnings bytecomp byte-compile cconv
macroexp latex easy-mmode edmacro kmacro tex-style cus-edit wid-edit
cus-start cus-load pp mew-varsx mew-unix cal-menu calendar
cal-loaddefs mew-auth mew-config mew-imap2 mew-imap mew-nntp2 mew-nntp
mew-pop mew-smtp mew-ssl mew-ssh mew-net mew-highlight mew-sort
mew-fib mew-ext mew-refile mew-demo mew-attach mew-draft mew-message
mew-thread mew-virtual mew-summary4 mew-summary3 mew-summary2
mew-summary mew-search mew-pick mew-passwd mew-scan mew-syntax mew-bq
mew-smime mew-pgp mew-header mew-exec mew-mark mew-mime mew-edit
mew-decode mew-encode mew-cache mew-minibuf mew-complete mew-addrbook
mew-local mew-vars3 mew-vars2 mew-vars mew-env mew-mule3 mew-mule
mew-gemacs mew-key mew-func mew-blvs mew-const mew tex advice help-fns
advice-preload tex-site auto-loads quail help-mode easymenu cjktilde
disp-table time-date tooltip ediff-hook vc-hooks lisp-float-type
mwheel x-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment lisp-mode register page menu-bar rfn-eshadow
timer select scroll-bar mouse jit-lock font-lock syntax facemenu
font-core frame cham georgian utf-8-lang misc-lang vietnamese tibetan
thai tai-viet lao korean japanese hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer loaddefs button faces
cus-face files text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget hashtable-print-readable backquote
make-network-process dbusbind dynamic-setting system-font-setting
font-render-setting move-toolbar gtk x-toolkit x multi-tty emacs)
emacs-problem.utf8
Description: Binary data
- bug#12291: [rev 109796] wrong UTF-8 handling,
Werner LEMBERG <=