[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: u32_normalize UNINORM_NFKC on 0xD800

From: Simon Josefsson
Subject: Re: u32_normalize UNINORM_NFKC on 0xD800
Date: Fri, 27 May 2011 11:23:03 +0200
User-agent: Gnus/5.110018 (No Gnus v0.18) Emacs/23.2 (gnu/linux)

Bruno Haible <address@hidden> writes:

> Simon Josefsson wrote:
>> I'm doing some Unicode NFKC operations and noticing that u32_normalize
>> fails for U+D800.
> This is a valid behaviour, because U+D800 is a "surrogate" point code
> and therefore not a valid character code point.
> See the Unicode standard, chapter 2 [1], pages 23..24:
> Surrogate code points and other non-character code points "should never be
> interchanged". This means, for libunistring, that they are invalid input
> and invalid output in all functions taking or returning UTF-32 strings or
> UTF-8 strings.
> Character code points and code points that are in regions that may be assigned
> in future Unicode versions must not be rejected; these are valid input.

I'm not interchanging the code points, I'm calculating this IDNA2008

   toNFKC(toCaseFold(toNFKC(cp))) != cp

for all code points.  Is this impossible to do with the u32_normalize

I notice that ICU also gives an error in this situation:


I wonder what the above expression means when toNFKC fails..

I managed to work around this using a local patch to make u32_uctomb
mimic u32_mbtouc_unsafe's behaviour.  But I'm not sure if I'm going to
use it.

--- lib/unistr/u32-uctomb.c.orig        2011-05-27 11:16:00.112466242 +0200
+++ lib/unistr/u32-uctomb.c     2011-05-27 11:16:01.696467065 +0200
@@ -30,8 +30,10 @@
 u32_uctomb (uint32_t *s, ucs4_t uc, int n)
   if (uc < 0xd800 || (uc >= 0xe000 && uc < 0x110000))
       if (n > 0)
           *s = uc;
@@ -39,9 +41,11 @@
         return -2;
     return -1;


reply via email to

[Prev in Thread] Current Thread [Next in Thread]