[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [GNUnet-developers] encoding: normalization [Was: Re: Music insertio

From: Alexander Winston
Subject: Re: [GNUnet-developers] encoding: normalization [Was: Re: Music insertion]
Date: Sun, 05 Dec 2004 16:15:29 -0500

On Sun, 2004-12-05 at 15:18 -0500, Christian Grothoff wrote:
> On Saturday 04 December 2004 17:20, Alexander Winston wrote:
> > Unicode provides 4 normalization forms
> > (<>):
> >
> > * Normalization Form D (NFD)
> > * Normalization Form C (NFC)
> > * Normalization Form KD (NFKD)
> > * Normalization Form KC (NFKC)
> >
> > Given the nature of GNUnet, I suggest normalizing all the proposed
> > keywords using NFC and NFKC, removing the duplicate keywords, and then
> > adding the remaining keywords.
> >
> > I still have little experience with normalization, however, so please
> > take this advice with a grain of salt.
> Right.  Even if we use UTF-8, we still have to think about normalization.  
> And 
> I believe this issue fully applies to UTF-8 (after all, UTF-8 is just a 
> unicode encoding).  Actually, it might be worse: if I recall correctly there 
> are different UTF-8 encodings for some unicode characters, so we have the 
> normalization issue for unicode *and* for UTF-8.  So if anyone has any 
> experience here, please speak up.  I was thinking of using libiconv to 
> convert to UTF-8.  Will this produce a canonical representation?  Which one? 
> If not, is there some free code available that will do the canonicalization? 

Yes, for example, consider U+00FC LATIN SMALL LETTER U WITH DIAERESIS.
In UTF-8, this has the byte sequence 0xC3 0xBC. The canonical
decomposition for this character is U+0075 LATIN SMALL LETTER U + U+0308
COMBINING DIAERESIS. In UTF-8, this has the byte sequence 0x75 0xCC

Using NFC on U+00FC would get you U+00FC. Using NFD on U+00FC would get
you U+0075 + U+0308.

Using NFC on U+0075 + U+0303 would get you U+00FC. Using NFD on U+0075 +
U+0308 would get you U+0075 + U+0308.

To my understanding, libiconv will do Unicode normalization if you ask
it to do so. If not, I'm fairly sure that GLib has code to do Unicode
normalization as well.

Hope this helps.

Attachment: signature.asc
Description: This is a digitally signed message part

reply via email to

[Prev in Thread] Current Thread [Next in Thread]