bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like their


From: Mingye Wang (Arthur2e5)
Subject: [bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like their windows counterparts
Date: Wed, 23 Nov 2016 21:03:56 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2

Hi,

It seems to me that the implementation of a few Windows code pages in libiconv does not behave like their Windows counterparts.

For clarity I am using the $'ansi-c-escape' literal in bash, with my console set to UTF-8. `iconv-version' returns iconv (Ubuntu GLIBC 2.23-0ubuntu4) 2.23.

Missing euro sign in cp936
--------------------------

The single-byte euro sign at 0x80 might be the most well-known modification that Microsoft has done to GBK. But well, it's not present in libiconv:

$ iconv -f cp936 -t utf-8 <<< $'\x80'
iconv: illegal input sequence at position 0
$ iconv -t cp936 -f utf-8 <<< $'\u20ac' | hexdump -C
iconv: illegal input sequence at position 0


cp950 has no mappings for HKSCS
-------------------------------

Microsoft have released a some updates to code page 950 so it includes HKSCS. Among these updates is the well-known "cp951" hack.[1]
  [1]: https://blogs.msdn.microsoft.com/shawnste/2007/03/12/cp-951-hkscs/

But well, iconv's cp950 does not even know the first Big5-EUDC character[2] in HKSCS:

$ iconv -f cp950 -t utf-8 <<< $'\x87\x40'
iconv: illegal input sequence at position 0

[2]: http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt

The reverse does not work either:

$ iconv -t cp936 -f utf-8 <<< $'\u43f0' | hexdump -C
iconv: illegal input sequence at position 0
$ iconv -t cp936 -f utf-8 <<< $'\uf266' | hexdump -C
iconv: illegal input sequence at position 0

... where the latter is one of these sequential PUA assignments for Big5-EUDC seen in MS's best-fit chart.[3] Since it's a bidirectional conversion, this assignment is not part of "best fit" behavior per [4]. [3]: ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt [4]: ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

0x81 and 0x8d for cp1252, etc.
------------------------------

Windows' single-byte code pages map like latin-1 (with C0 and C1) bidirectionally if no other values are defined for these bytes. libiconv does not display this behavior for cp1250, cp1252, etc.

$ iconv -f cp1252 -t utf-8 <<< $'\x81'
iconv: illegal input sequence at position 0
$ iconv -f cp1252 -t utf-8 <<< $'\x8d'
iconv: illegal input sequence at position 0
$ iconv -f cp1250 -t utf-8 <<< $'\x81'
iconv: illegal input sequence at position 0
$ iconv -t cp1252 -f utf-8 <<< $'\u0081' | hexdump -C
iconv: illegal input sequence at position 0


--
Regards,

Arthur2e5



Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]