gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Re: [semi-OT] Unicode / han unification (was Re: Spaces


From: Miles Bader
Subject: [Gnu-arch-users] Re: [semi-OT] Unicode / han unification (was Re: Spaces ...)
Date: 22 Jan 2004 11:19:19 +0900

Tom Lord <address@hidden> writes:
> My personal opinion is that the Unicode consortium is probably right.
> While I can't personally evaluate the CJK issue based on my own
> knowledge, in those areas (both linguistic and computational) where I
> _am_ qualified to judge their arguments and decisions -- they are
> unfailingly wise.

It's at best rumour, but I have heard that a major impetus behind han
unification was to save space (in a 16-bit encoding), not `correctness'.

My personal test is the `README test':  I'd like `cat README' to always
yield something appropriate even on a dumb terminal -- even if the README
file is part of a Chinese package, and I'm reading it on my American
computer (say at a university where the computer systems have to cater to a
very diverse audience).

As far as I know, basic Unicode doesn't do this correctly for CJK, though
it apparently does for other character sets.

The problem, as I understand it, is that although all these characters have
a shared history, and in many cases are in fact exactly the same character
(with maybe a very slight difference in detailing), some have diverged
quite a bit in appearance, to the point where they are unrecognizable if
displayed in the wrong `font'.  In such a case are they the same character?
I dunno; for some usages that makes a lot of sense, for others, it doesn't.
I'll bet they could have done quite nicely with a sort of 90% unification:
unify everything that looks pretty much the same (a lot), and keep separate
code-points for stuff that has changed dramatically.  You'd still get
complaints of course, but at least a baseline of `always readable' would be
better met.

I seem to recall that there are somewhat kludgey additions to make things
work; I forget the specifics, but you can basically embed a sort of string
specifying the name of the font or locale or whatever, using appropriate
weird high bits in the characters of the name).  I don't know how widely
this is supported by Unicode applications though.

-Miles
-- 
Saa, shall we dance?  (from a dance-class advertisement)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]