Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode

From:	Eli Zaretskii
Subject:	Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Date:	Thu, 19 Jan 2017 18:26:30 +0200

> From: Assaf Gordon <address@hidden>
> Date: Thu, 19 Jan 2017 00:46:47 -0500
> 
> > 5. How do we handles MinGW and Cygwin where wchar_t is 16 bits, vs. 32
> > bits just about everywhere else?
> 
> The parsing is the same (i.e. "\uHHHH" to internal 'unsigned int' or 'ucs4_t' 
> from gnulib).
> 
> The conversion to multibyte will use gnulib's (or the system's native)
> widechar-to-multibyte functions.
> 
> In case of cygwin/mingw, an extra step of converting the 'uint32' to two 
> 'uint16' is needed,
> and then two calls for wctomb are needed.

I don't see how this could work: AFAIK the MS-Windows wctomb accepts a
single wchar_t value, so it can only support Unicode codepoints inside
the BMP.  You cannot call it with 2 wchar_t values one after the other
to get support for the full Unicode range.  (This is relevant to
MinGW; I think Cygwin doesn't have this problem.)

Really, to have a decent support for Unicode on MS-Windows, you will
need to abandon the Windows runtime support for wchar_t, and instead
use your own 32-bit data type and conversion functions.

One more quirk of MS-Windows is that no locale can use UTF-8 as its
codeset, so the assumption of "UTF-8 locale everywhere" is not useful
on Windows.

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/10
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/11
- Re: [bug-gawk] [Grep-devel] GNU grep, awk, sed: support \u and \U for unicode, Paul Eggert, 2017/01/11
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, arnold, 2017/01/11
  - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/19
    - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Eli Zaretskii <=
    - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/19
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Norihiro Tanaka, 2017/01/19
  - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/19
    - Re: [bug-gawk] [Grep-devel] GNU grep, awk, sed: support \u and \U for unicode, Paul Eggert, 2017/01/19
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, David Niklas, 2017/01/24

Prev by Date: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Next by Date: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Previous by thread: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Next by thread: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Index(es):
- Date
- Thread