bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#7948: 16-bit wchar_t on Windows and Cygwin


From: Bruno Haible
Subject: bug#7948: 16-bit wchar_t on Windows and Cygwin
Date: Wed, 2 Feb 2011 12:29:03 +0100
User-agent: KMail/1.9.9

Hello Eric,

> ... POSIX requires that 1 wchar_t corresponds to 1 character
> ...
> > What consequences does this have?
> > 
> >   1) All code that uses the functions from <wctype.h> (wide character
> >      classification and mapping) or wcwidth() malfunctions on strings that
> >      contains Unicode characters outside the BMP, i.e. outside the range
> >      U+0000..U+FFFF.
> 
> Not necessarily.  Such code falls outside of POSIX, but it may still be
> a well-behaved extension if given sane behavior for how to deal with
> surrogates.

No. Code that uses <wctype.h> and wcwidth() is written precisely according
to POSIX. The problem is that this code cannot work correctly when wchar_t[]
is in UTF-16 encoding. There simply is no way to define these functions
in a reasonable way for surrogates.

For example:
  U+1031E = 0xD800 0xDF1E   is a letter (iswalpha should be true)
  U+10320 = 0xD800 0xDF20   is not a letter (iswalpha should be false)
  U+1D31E = 0xD834 0xDF1E   is not a letter (iswalpha should be false)
  U+1D320 = 0xD834 0xDF20   is not a letter (iswalpha should be false)
  U+1D71E = 0xD835 0xDF1E   is a letter (iswalpha should be true)
  U+1D720 = 0xD835 0xDF20   is a letter (iswalpha should be true)
There is no way that a system can provide this information through a
function 'iswalpha' that takes a single wchar_t argument.

It would be possible to provide this information
  - either through a function iswalpha2 (wchar_t wc1, wchar_t wc2)
    that takes two wchar_t arguments,
  - or through a function uc_is_alpha (ucs4_t uc),
but that is not POSIX, and it would require rewriting each and every
piece of code that currently uses <wctype.h> in the POSIX way.

> we can (try) to make the various wc* functions try to
> behave as smartly as possible (as is the case with Cygwin); where those
> smarts are only needed when you use surrogate pairs.

The point is that this approach can work fine for mbrtowc() and wcrtomb(),
but it cannot yield a working definition for the <wctype.h> functions and
wcwidth().

> >   2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction.
> >      On Cygwin >= 1.7 mbrtowc() and wcrtomb() is implemented in an 
> > intelligent
> >      but somewhat surprising way: wcrtomb() may return 0, that is, produce 
> > no
> >      output bytes when it consumes a wchar_t.
> 
> >   Now with a chinese character outside the BMP:
> >   $         
> >         1       4
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         3       6
> > 
> >   On Cygwin 1.7.5 (with LANG=C.UTF-8 and 'wc' from GNU coreutils 8.5):
> > 
> >   $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m
> >         1       5
> >   $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m
> >         2       7
> >
> >   So both the number of characters and the number of words are counted
> >   wrong as soon as non-BMP characters occur.
> >
> 
> Does this represent a bug in cygwin's mbrtowc routines that could be
> fixed by cygwin?
> 
> Or, does this represent a bug in coreutils for using mbrtowc one
> character at a time instead of something like mbsrtowcs to do bulk
> conversions?

We agree that it is a bug. And it is caused by
  - the fact that Cygwin's wchar_t[] encoding is UTF-16, and
  - there is no way to define the <wctype.h> POSIX functions sanely in this
    setting, and
  - coreutils and gnulib make use of the POSIX functions.

Even if coreutils were to use mbsrtowcs instead of repeated use of
mbrtowc, there would be no way for it to produce the correct result
without combining surrogates into entire characters.

> And if we decide that cygwin's extensions are sane, how much harder is
> it to characterize what a program must do to be portable to both 16-bit
> and 32-bit wchar_t if they are guaranteed the same behavior for all
> hosts of the same-size wchar_t?  In other words, would it really require
> that many #ifdefs in coreutils to portably and simultaneously support
> both sizes of wchar_t?

It would require
  1. to change the conversions that use mbrtowc to either convert an
     entire string at once (use mbsrtowcs), or make a second call to
     mbrtowc once the first call to mbrtowc has determined a low
     surrogate.
  2. to change all uses of <wctype.h> and wcwidth() to use different
     functions, either functions that take 2 wchar_t arguments, or
     functions that require the caller to combine the surrogates.

This means, lots of logic that goes against the spirit of wchar_t
in ANSI C Amd. 1 and POSIX.

> > I'm more in favour of overriding wchar_t and all functions that depend on 
> > it -
> > like we did successfully for the socket functions.
> > 
> > In practice, this would mean that on Windows (both native Windows and
> > Cygwin >= 1.7) the use of a 'wchar_t' module will
> >   - override wchar_t to be 32 bits, like in glibc,
> >   - cause functions from mbrtowc() to wcwidth() to be overridden. Since the
> >     corresponding system functions are unusable, the replacements will use 
> > the
> >     modules from libunistring (such as unictype/ctype-alnum and 
> > uniwidth/width).
> ...
> compiler primitives, like L"xyz", which result in 16-bit wchar_t
> arrays, will be unusable

Good point. I agree then that overriding wchar_t should better not be
done.

> C1x will be adding compiler support for mandatory char16_t and char32_t
> types for UTF-16 and UTF-32 data, independently of whether wchar_t is
> 16-bit or 32-bit; maybe the better thing is to proactively start
> providing the new interfaces in <uchar.h> that will result from C1x
> adoption (and convert GNU programs to use this rather than wchar_t for
> character operations)
> 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists:

A newer draft is at
https://www.opengroup.org/platform/single_unix_specification/uploads/40/23495/n1548.pdf

This is a good point, but would have two drawbacks:

  - It throws out the use of a POSIX API for a not-yet-standard API,

  - Performance: For the non-UTF-8 locales (ISO-8859-15, EUC-JP, and
    similar) on platforms like MacOS X, FreeBSD, Solaris, the 'wchar_t'
    representation is essentially a packed multibyte representation.
    Which makes mbrtowc() fast, because it does not have to do a table
    lookup for the conversion from/to Unicode. If you use mbrtoc32
    instead of mbrtowc, you add extra runtime overhead for a conversion
    to Unicode, that would not be necessary when using mbrtowc().

In other words, your proposal would solve the Windows wchar_t problem,
but at the price of a performance penalty on traditional Unix systems.

Here's a new proposal:
  - Define a type 'wwchar_t' on all platforms, equivalent to uint32_t
    on Windows platforms and to 'wchar_t' otherwise.
  - Define functions 'mbrtowwc', 'iswwalpha', 'wwcwidth', and similar.
    Their definition will be a trivial redirection to 'mbrtowc', 'iswalpha',
    'wcwidth' on most platforms, and a use of libunistring modules on
    Windows platforms.

With this proposal,

  - The code that uses <wctype.h> has to be changed, but in a trivial
    way that introduces no complicated logic: Just change 'w' to 'ww'.
    Not more difficult than, say, using strtoll() instead of strtol().

  - The runtime penalty on non-Windows systems is minimal.

  - On Windows platforms, surrogates are handled correctly, and
    code that uses wchar_t or <windows.h> is left alone.

How does that sound? Comments?

Bruno
-- 
In memoriam Carl Friedrich Goerdeler 
<http://en.wikipedia.org/wiki/Carl_Friedrich_Goerdeler>





reply via email to

[Prev in Thread] Current Thread [Next in Thread]