Re: [Gnu-arch-users] [semi-OT] Unicode / han unification (was Re: Spaces

From: Tom Lord
Subject: Re: [Gnu-arch-users] [semi-OT] Unicode / han unification (was Re: Spaces ...)
Date: Wed, 21 Jan 2004 17:56:25 -0800 (PST)

    > From: David Brown <address@hidden>

    > Korean is a bit more "annoying", since Unicode provides several
    > different ways to encode a single glyph.  There are two encodings in
    > Unicode that take several Unicode code points and map to a single glyph.
    > So, for example, 'Han' could be three code points, representing 'H',
    > then 'A', then 'N'.  There are two different encodings just for this.
    > Then, given a complex set of rules, this can be spilled down to a single
    > glyph for the syllable 'Han'.  There is also a codepoint just for the
    > symbol 'Han'.

    > All this means is that, especially for Korean, determine if two strings
    > are equal is quite complex.

You are talking about "canonical combining forms" and other
canonicalization issues, yes?

Those issues are not at all unique to CJK.  And yes, they are complex
enough to require considerable care to make them manageable --- but I
don't see any way to make things simpler than Unicode already does, do
you?   The Unicode consortium has at least provided some strong hints
about how to do it.


