Re: [GNUnet-developers] encoding: normalization [Was: Re: Music insertion]

From: Alexander Winston
Subject: Re: [GNUnet-developers] encoding: normalization [Was: Re: Music insertion]
Date: Sun, 05 Dec 2004 16:15:29 -0500

On Sun, 2004-12-05 at 15:18 -0500, Christian Grothoff wrote:
> On Saturday 04 December 2004 17:20, Alexander Winston wrote:
> > Unicode provides 4 normalization forms
> > (<>):
> >
> > * Normalization Form D (NFD)
> > * Normalization Form C (NFC)
> > * Normalization Form KD (NFKD)
> > * Normalization Form KC (NFKC)
> >
> > Given the nature of GNUnet, I suggest normalizing all the proposed
> > keywords using NFC and NFKC, removing the duplicate keywords, and then
> > adding the remaining keywords.
> >
> > I still have little experience with normalization, however, so please
> > take this advice with a grain of salt.
> Right.  Even if we use UTF-8, we still have to think about normalization.  
> And 
> I believe this issue fully applies to UTF-8 (after all, UTF-8 is just a 
> unicode encoding).  Actually, it might be worse: if I recall correctly there 
> are different UTF-8 encodings for some unicode characters, so we have the 
> normalization issue for unicode *and* for UTF-8.  So if anyone has any 
> experience here, please speak up.  I was thinking of using libiconv to 
> convert to UTF-8.  Will this produce a canonical representation?  Which one? 
> If not, is there some free code available that will do the canonicalization? 

Yes, for example, consider U+00FC LATIN SMALL LETTER U WITH DIAERESIS.
In UTF-8, this has the byte sequence 0xC3 0xBC. The canonical
decomposition for this character is U+0075 LATIN SMALL LETTER U + U+0308
COMBINING DIAERESIS. In UTF-8, this has the byte sequence 0x75 0xCC

Using NFC on U+00FC would get you U+00FC. Using NFD on U+00FC would get
you U+0075 + U+0308.

Using NFC on U+0075 + U+0303 would get you U+00FC. Using NFD on U+0075 +
U+0308 would get you U+0075 + U+0308.

To my understanding, libiconv will do Unicode normalization if you ask
it to do so. If not, I'm fairly sure that GLib has code to do Unicode
normalization as well.

Hope this helps.

