bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fix u32toutf8 so it encodes values > 0xFFFF correctly.


From: John Kearney
Subject: Re: Fix u32toutf8 so it encodes values > 0xFFFF correctly.
Date: Thu, 23 Feb 2012 06:06:18 +0100
User-agent: Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20120129 Thunderbird/10.0

And on the up side if they do ever give in and allow registration of
family name characters we may get a wchar_t, schar_t lwchar_t and a
llwchar_t
:)
just imagine a variable length 64bit char system.

Everything from Sumerian to Klingon in Unicode, though I think they
already are, though not officially, or are being done,

Oh god what I really want now is bash in klingon.

:))
just imagine black blackround glaring green text.
know what I'm doing tonight.

check out ( shakes head in disbelief, while chuckling )
Ubuntu Klingon Translators https://launchpad.net/~ubuntu-l10n-tlh
Expansion: Ubuntu Font should support pIqaD (Klingon)
https://bugs.launchpad.net/ubuntu/+source/ubuntu-font-family-sources/+bug/650729



On 02/23/2012 04:54 AM, Eric Blake wrote:
> On 02/22/2012 07:43 PM, John Kearney wrote:
>> ^ caviot you can represent the full 0x10ffff in UTF-16, you just
>> need 2 UTF-16 characters. check out the latest version of
>> unicode.c for an example how.
> 
> Yes, and Cygwin actually does this.
> 
> A strict reading of POSIX states that wchar_t must be wide enough
> for all supported characters, technically limiting things to just
> the basic plane if you have 16-bit wchar_t and a POSIX-compliant
> app.  But cygwin has exploited a loophole in the POSIX wording -
> POSIX does not require that all bit patterns are valid characters.
> So the actual Cygwin implementation is that on paper, rather than
> representing all 65536 patterns as valid characters, the values
> used in surrogate halves (0xd800 to 0xdfff) are listed as
> non-characters (so the use of them triggers undefined behavior per
> POSIX), but actually using them treats them as surrogate pairs
> (leading to the full Unicode character set, but reintroducing the
> headaches that multibyte characters had with 'char', but now with
> wchar_t, where you are back to dealing with variable-sized 
> character elements).
> 
> Furthermore, the mess of 16-bit vs. 32-bit wchar_t is one of the
> reasons why C11 has introduced two new character types, 16-bit and
> 32-bit characters, designed to fully map to the full Unicode set,
> regardless of what size wchar_t is.  It will be interesting to see
> how the next version of POSIX takes the additions of C11 and
> retrofits the other wide-character functions in POSIX but not C99
> to handle the new character types.
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]