[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
From: |
Eli Zaretskii |
Subject: |
Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode |
Date: |
Thu, 19 Jan 2017 18:26:30 +0200 |
> From: Assaf Gordon <address@hidden>
> Date: Thu, 19 Jan 2017 00:46:47 -0500
>
> > 5. How do we handles MinGW and Cygwin where wchar_t is 16 bits, vs. 32
> > bits just about everywhere else?
>
> The parsing is the same (i.e. "\uHHHH" to internal 'unsigned int' or 'ucs4_t'
> from gnulib).
>
> The conversion to multibyte will use gnulib's (or the system's native)
> widechar-to-multibyte functions.
>
> In case of cygwin/mingw, an extra step of converting the 'uint32' to two
> 'uint16' is needed,
> and then two calls for wctomb are needed.
I don't see how this could work: AFAIK the MS-Windows wctomb accepts a
single wchar_t value, so it can only support Unicode codepoints inside
the BMP. You cannot call it with 2 wchar_t values one after the other
to get support for the full Unicode range. (This is relevant to
MinGW; I think Cygwin doesn't have this problem.)
Really, to have a decent support for Unicode on MS-Windows, you will
need to abandon the Windows runtime support for wchar_t, and instead
use your own 32-bit data type and conversion functions.
One more quirk of MS-Windows is that no locale can use UTF-8 as its
codeset, so the assumption of "UTF-8 locale everywhere" is not useful
on Windows.
- [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/10
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/11
- Re: [bug-gawk] [Grep-devel] GNU grep, awk, sed: support \u and \U for unicode, Paul Eggert, 2017/01/11
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, arnold, 2017/01/11
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Norihiro Tanaka, 2017/01/19