[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Multibyte support (round 2)
From: |
Assaf Gordon |
Subject: |
Re: Multibyte support (round 2) |
Date: |
Mon, 29 Aug 2016 20:02:24 -0400 |
Hello Eric,
On 08/29/2016 01:13 PM, Eric Blake wrote:
> On 08/27/2016 12:05 AM, Assaf Gordon wrote:
>> Regarding wchar_t == UCS:
> But not in Cygwin, where wchar_t is 2 bytes, and where Cygwin already
> supports surrogate pairs in wchar_t to represent Unicode characters
> beyond 0xffff
Thank you for mentioning this.
On AIX-32bit wchar_t is also 2bytes, but I'm not sure if UCS2 or just BMP.
I can think of few options:
1. Process entire lines, keep them in-memory as multibyte strings in the
current locale,
then use gnulib's unicode-normalization functions take take an entire string
(e.g. u8_normalize).
(This was the initial implementation, in
http://lists.gnu.org/archive/html/coreutils/2016-07/msg00018.html ).
2. Detect such systems (where wchar_t==UCS2 or BMP) in runtime or at
configuration time,
and then either:
2.1: issue a warning if the input is beyond BMP (meaning partial unicode
normaliation support on such systems)
2.2: add additional code to convert UCS-2 surrogate pairs into UCS4
3. Decide not to support unicode normalization on such systems (beyond what
'just works' with BMP characters).
Comments welcomed,
- assaf