[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogate pairs for addwstr?

From: Tim Allen
Subject: Re: Surrogate pairs for addwstr?
Date: Mon, 11 Oct 2021 15:05:49 +1100

On Sun, Oct 10, 2021 at 11:38:22AM -0400, Bill Gray wrote:
>    The other way to put this would be to ask : if you're on a
> system with 32-bit wchar_ts,  what should happen for this line?
>   mvaddwstr( 0, 2, L"\xd83d\xdd1e Treble clef with a surrogate pair");

Honestly, what I'd *expect* to happen is a compile-time or run-time
error. This is what, for example, Rust does:

    error: invalid unicode character escape
     --> src/main.rs:2:34
    2 |     println!("Treble clef: {}", "\u{d83d}\u{dd1e}");
      |                                  ^^^^^^^^ invalid escape
      = help: unicode escape must not be a surrogate

...and also what Python 3 does:

    >>> print("\ud83d\udd1e")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'utf-8' codec can't encode characters in position
    0-1: surrogates not allowed

I guess C/C++ compilers don't report this as a problem because
technically the wide-character encoding is a property of libc, not of
the compiler, and they don't want to assume that libc is Unicode-based.

> #ifdef USING_UTF_16

Apparently the incantation is:

    #if WCHAR_MAX == 65535

>    At present,  I've got surrogate pairs combining regardless of
> encoding in PDCursesMod;  is there really a situation where I ought
> to instead be displaying glyphs of some sort for U+D800 to U+DFFF?

Printing gibberish is never particularly helpful, but encouraging people
to assume wide-string literals (or wide-strings in general) use UTF-16
encoding seems like a bad idea. Sure, you can make it work transparently
for curses, but there's other libraries (like libc) that are likely to
get tripped up, and that seems like a foot-gun waiting to happen. Even
if you provide a utf16towcs() helper, people are going to forget to call
it since the input and output types are both wchar_t*.

The absolute simplest and safest thing a portable program could do is to
restrict itself to the Basic Multilingual Plane. The second simplest and
safest thing would probably be to store strings as UTF-8 (narrow) string
literals, and provide some kind of utf8stowcs() that decodes to UTF-16
or to UTF-32 depending on the value of WCHAR_MAX.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]