bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Updating iconv tables


From: Jim Breen
Subject: Re: [bug-gnu-libiconv] Updating iconv tables
Date: Thu, 12 Jun 2008 11:34:40 +1000

Hi Bruno,

Great to hear from you

2008/6/12 Bruno Haible <address@hidden>:

> I'm not sure I understand it all right.
>
>> When people have
>> gone to convert the EDICT file to UTF8 for other
>> systems, the iconv utility simply dies on that character
>
> In summary, you are saying that you have a particular character in EUC-JP,
> that the iconv conversion from EUC-JP to UTF-8 does not grok?
>
> Then the character is not EUC-JP.

Wrong. I'll explain more below.

> I'm not sure which character you are talking about, because your mail
> had an encoding specification of ISO-2022-JP, which usually means
> ISO-2022-JP-2, but that particular character was invalid in ISO-2022-JP-2
> (it was encoded as "ESC $ B - j"), the other character in that line was
> U+682A, and you were talking about U+3231.

This is a bit of a side issue. My email was indeed in ISO-2022-JP, since
I have gmail set to use the default for the language, and my email
contained Japanese. The code-point question converts and displays
correctly in compliant mailers. Nothing illegal about it.

>> The problem, I conclude, is with the compiled-in tables
>> in iconv in the Linux distros. It seems Sun has gone to
>> the trouble of keeping theirs up-to-date, but the standard
>> distros haven't.
>
> You have a misconception of what EUC-JP is. EUC-JP is a character encoding
> scheme based on three standards: ASCII, JIS X 0208, and JIS X 0212. These
> are standards issued by Japanese authorities, and carved in stone. Anyone
> who thinks that EUC-JP tables have to be "kept up-to-date", is asking for
> deviation from standards, and is asking for interoperability problems!

You are out-of-date there. EUC-JP also includes JIS X 0213, which was released
in 2000 and updated in 2004. The codepoint I raised arrived in JIS X 0213. You
can think of JIS X 0213 as an enhancement/replacement for JIS X 0208. It added
a heap of additional characters, *all* of which have been included in Unicode,
and all of which have EUC codings, since EUC-JP is simply a transformation
of the ku-ten codes in the Japanese standards. Of course EUC-JP tables need to
be kept up-to-date.

See: http://en.wikipedia.org/wiki/JIS_X_0213 for an overview.

> The interoperability problem that you encountered is *precisely* due to
> your vendor having added "extensions" to their EUC-JP fonts, and you
> expect that everyone else has the same extensions in their fonts and tables!
> Take a look at
>   http://www.haible.de/bruno/charsets/conversion-tables/EUC-JP.html
> to see how many variants of EUC-JP already exist!

Sadly your WWW page omits any mention of JIS X 0213. In other words it is
lacking all the characters added to the standard Japanese codings in the last
decade. Sun has simply kept up with the developments in Japanese
coding. These are
*not* vendor extensions.

In case you think I am talking through my hat, I must point out that I am
one of only a handful of non-Japanese people who have participated in the
development of the Japanese standards. You will find my name among the
respondents at the back of JIS X 0208-1997, along with people like Ken Lunde
and Martin Duerst. (I assume you have a copy.) Ask Ken if he has heard of me.

I am happy to work with you in getting the full set of current Japanese
codes into iconv. As it stands at the moment, the GNU issue does not
adequately hand all the standard Japanese codes.

Best wishes

Jim

-- 
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]