gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] [OT] Unicode vs. legacy character sets


From: Stephen J. Turnbull
Subject: Re: [Gnu-arch-users] [OT] Unicode vs. legacy character sets
Date: Tue, 02 Mar 2004 13:21:45 +0900
User-agent: Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.5 (celeriac, linux)

>>>>> "Tom" == Tom Lord <address@hidden> writes:

    Tom> In principle, Unicode could be upwardly compatibly extended
    Tom> in ways that "undo" the unification.

No, it can't.  The Unicode _standard_ could be extended.  But private
extensions are not Unicode, and it's back to the Tower of Babel.

    Tom> I'm willing to have libhackerlab (hence Pika and arch) use an
    Tom> _extended_ Unicode.  Standardizing, within those libraries
    Tom> and programs on assigning-by-convention some private-use
    Tom> codepoints to un-unified characters.

Don't.  Start by finding out what you can already do within Unicode.

(1) There is already standardization work going on at ISO 10646 (the
ISO group that certifies characters) on adding tens of thousands of
Han characters as an "extended ideographic plane".  I haven't been
following the discussion closely, but as of a couple of years ago
there was a strong contingent in favor of doing a fair amount of
"de-unification".

(2) You can already portably de-unify Han characters by use of Plane
14 language tags.  This has the advantage that "dumb" Unicode apps
_must_ ignore the tags (since they will be composed of surrogate
characters in UTF-16), and thus will get at least the minimal sanity
imposed by Unihan.  If you use a private block extension, then only
apps using your extension can do _anything_ except test for equality
of characters.  Yecch.  NB: because of the round-tripping rule for
national standards, Plane 14 tags are sufficient to uniquely
disambiguate _all_ standardized Han characters.

If you want to go beyond the strictures of Unicode, you might want to
look at the UTF-2000 Project (I think they now call themselves CHISE,
but that might be a separate project) http://www.m17n.org/utf-2000/
and http://www.kanji.zinbun.kyoto-u.ac.jp/projects/chise/xemacs/ (I
think this is all Japanese, sorry).  There is also a new
multilingualization library released just a few days ago,
http://www.m17n.org/lib-m17n/ (at least some "English" text).
(Despite the "xemacs" in that URL, CHISE actually is a general
character database and library, and IIANM there are perl and ruby
versions.)

However, even the CHISE people will say that their databases are for
historical linguistic work, not for modern use, except by people who
use variant glyphs for the same reasons teenage girls dot their i's
with hearts.

    Tom> Beyond that -- it could provide a practical demonstration (or
    Tom> refutation) of the benefits of undoing the unification in
    Tom> Unicode.

Nope.  "De gustibus non disputandum."  You may convince the
unconvinced, but you won't change the minds of the convinced.


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]