Re: Multibyte support (round 2)

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte support (round 2)

From:	Assaf Gordon
Subject:	Re: Multibyte support (round 2)
Date:	Mon, 29 Aug 2016 20:02:24 -0400

Hello Eric,

On 08/29/2016 01:13 PM, Eric Blake wrote:
> On 08/27/2016 12:05 AM, Assaf Gordon wrote:
>> Regarding wchar_t == UCS:
> But not in Cygwin, where wchar_t is 2 bytes, and where Cygwin already
> supports surrogate pairs in wchar_t to represent Unicode characters
> beyond 0xffff

Thank you for mentioning this.
On AIX-32bit wchar_t is also 2bytes, but I'm not sure if UCS2 or just BMP.

I can think of few options:

1. Process entire lines, keep them in-memory as multibyte strings in the 
current locale,
then use gnulib's unicode-normalization functions take take an entire string 
(e.g. u8_normalize).
(This was the initial implementation, in 
http://lists.gnu.org/archive/html/coreutils/2016-07/msg00018.html ).

2. Detect such systems (where wchar_t==UCS2 or BMP) in runtime or at 
configuration time,
and then either:
2.1: issue a warning if the input is beyond BMP (meaning partial unicode 
normaliation support on such systems)
2.2: add additional code to convert UCS-2 surrogate pairs into UCS4

3. Decide not to support unicode normalization on such systems (beyond what 
'just works' with BMP characters).

Comments welcomed,
- assaf

[Prev in Thread]

Current Thread

[Next in Thread]

Multibyte support (round 2), Assaf Gordon, 2016/08/27
- Re: Multibyte support (round 2), Eric Blake, 2016/08/29
  - Re: Multibyte support (round 2), Assaf Gordon <=

Prev by Date: [PATCH] dircolors: recognize .zst and .tzst suffixes
Next by Date: [PATCH] gnulib: update to latest, to port to upcoming GCC 7
Previous by thread: Re: Multibyte support (round 2)
Next by thread: [PATCH] ptx: avoid new warning/error from upcoming gcc-7.x
Index(es):
- Date
- Thread