[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogate pairs for addwstr?

From: Bill Gray
Subject: Re: Surrogate pairs for addwstr?
Date: Sun, 10 Oct 2021 11:38:22 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0

Hi Thomas,  Tim,

On 10/9/21 7:04 PM, Tim Allen wrote:
Surrogate pairs only combine to create a single character in UTF-16
encoded data, or on platforms (Windows, Java, JavaScript, macOS Cocoa)
that use UTF-16 as an internal representation. Code-points in the
surrogate pair range are not allowed to appear in un-encoded Unicode
data, so if they show up, at best they'll be ignored, but they might
show up as blanks or as U+FFFE � REPLACEMENT CHARACTER.

ncurses' wide mode might use the locale's encoding (UTF-8, almost
universally) or might just hard-code UTF-8 as the internal
representation, since it's generally the best choice for the kind of
data ncurses handles. The behaviour you describe is within the range of
behaviour I'd expect.

   Thank you.  I see your points;  in theory,  U+D83D and U+DD1E
should only happen with UTF-16 data.  And in theory,  theory and
practice are the same thing.  In practice,  they aren't.

   The other way to put this would be to ask : if you're on a
system with 32-bit wchar_ts,  what should happen for this line?

  mvaddwstr( 0, 2, L"\xd83d\xdd1e Treble clef with a surrogate pair");

   At least when I run it in xterm,  the cursor advances twice for
the surrogate pairs.  I agree that doing so is certainly within what
you could expect for UTF-16.  But it is slightly problematic,  and I
don't see a drawback to recognizing the obvious intention and merging
the pair into U+1D11E.

   I will grant you that one can write,  say,

#ifdef USING_UTF_16
    #define TREBLE_CLEF L"\xd83d\xdd1e"
    #define TREBLE_CLEF L"\x1d11e"


  mvaddwstr( 0, 2, TREBLE_CLEF " Treble clef for your platform");

   and work around the problem.  (With,  I _think_,  USING_UTF_16
basically meaning "is sizeof( wchar_t) == 2",  but you can't do that
in a #define.)  But for a cross-platform solution,  it would
certainly be easier just to provide the surrogate pair.

   At present,  I've got surrogate pairs combining regardless of
encoding in PDCursesMod;  is there really a situation where I ought
to instead be displaying glyphs of some sort for U+D800 to U+DFFF?

-- Bill

reply via email to

[Prev in Thread] Current Thread [Next in Thread]