Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8

From:	Eric Blake
Subject:	Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Date:	Thu, 07 Jun 2012 09:42:55 -0600
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

On 06/07/2012 09:21 AM, Paolo Bonzini wrote:

>>> ... which is to wrap MB_CUR_MAX and pretend that it is 3.
>>
>> Actually, MB_CUR_MAX of UTF-8 is 6, thanks to surrogate pairs.
> 
> No, it is 6 mostly thanks to the original 32-bit definition of
> ISO-10646.  UTF-8 codes that decode to 0xD800 -> 0xDFFF are invalid.
> Some programs produce this encoding, but iconv will not support it on
> glibc and technically it's not UTF-8.
> 
> However I did count wrong, MB_CUR_MAX for UTF-8 must be at least 4 to
> encode the 21 bits of Unicode (3 in a first byte of the form 11110bbb, 6
> each in the next: 3+6*3 = 21).

You are correct that on glibc, where sizeof(wchar_t)==4, that MB_CUR_MAX
of 4 is valid.  But on Cygwin, where sizeof(wchar_t)==2, MB_CUR_MAX is
intentionally 6, because cygwin intentionally supports surrogate pairs
as the only way to represent high plane Unicode characters (although
such support is NOT compliant with POSIX, it is a useful enough
extension that it was deemed better than any other alternative - and
yes, that means that on Cygwin if you use any character > 0xffff, you
have multi-wchar_t encodings to deal with - which makes use of all the
wide character functions even harder to reason about).

-- 
Eric Blake   address@hidden    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, (continued)
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Pádraig Brady, 2012/06/01
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Stephen J. Butler, 2012/06/01
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paul Eggert, 2012/06/02
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/02
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Bruno Haible, 2012/06/07
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Pádraig Brady, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake <=
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/10
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/18
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/23
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paul Eggert, 2012/06/23
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/06
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/06
  - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paul Eggert, 2012/06/06
    - Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/06

Prev by Date: Re: [PATCH] maint.mk: fix VPATH issues
Next by Date: Re: [PATCH] maint.mk: fix VPATH issues
Previous by thread: Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Next by thread: Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
Index(es):
- Date
- Thread