[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8
From: |
Paolo Bonzini |
Subject: |
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8 |
Date: |
Thu, 07 Jun 2012 17:21:52 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 |
Il 07/06/2012 16:51, Eric Blake ha scritto:
> On 06/07/2012 08:13 AM, Paolo Bonzini wrote:
>> Il 07/06/2012 14:50, Eric Blake ha scritto:
>>>>> The fix could be to have two different locale_charset() functions,
>>>>> one that returns "US-ASCII" and another one that returns "UTF-8".
>>>>> The first one to be used when MB_CUR_MAX and mbrtowc() are used as
>>>>> well, the second one to be used by gettext(). But the separation
>>>>> line between the two cases is not yet clear to me. Any insights?
>>
>> The separation line is what you wrote: whether you'll use the text
>> simply for presentation, or whether you'll process it before. But
>> alternatively, we might try a variant of what Eric has suggested...
>>
>>> On OS X, can we wrap MB_CUR_MAX to pretend to be 1 when in the "C"
>>> locale, to match what cygwin did in distinguishing between 'C' and
>>> 'C.UTF-8'?
>>
>> ... which is to wrap MB_CUR_MAX and pretend that it is 3.
>
> Actually, MB_CUR_MAX of UTF-8 is 6, thanks to surrogate pairs.
No, it is 6 mostly thanks to the original 32-bit definition of
ISO-10646. UTF-8 codes that decode to 0xD800 -> 0xDFFF are invalid.
Some programs produce this encoding, but iconv will not support it on
glibc and technically it's not UTF-8.
However I did count wrong, MB_CUR_MAX for UTF-8 must be at least 4 to
encode the 21 bits of Unicode (3 in a first byte of the form 11110bbb, 6
each in the next: 3+6*3 = 21).
Paolo
- Fwd: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/01
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Pádraig Brady, 2012/06/01
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Bruno Haible, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Paolo Bonzini, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Pádraig Brady, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8,
Paolo Bonzini <=
- Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Eric Blake, 2012/06/07
Re: GNU sed version 4.2.1: on OS X, C locale gets aliased to UTF-8, Max Horn, 2012/06/10