[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: more on failing test 'invalid-mb-seq-UMR.sh'
From: |
Assaf Gordon |
Subject: |
Re: more on failing test 'invalid-mb-seq-UMR.sh' |
Date: |
Fri, 17 Jun 2016 00:39:08 -0400 |
Corrected mistake below:
> On Jun 17, 2016, at 00:06, Assaf Gordon <address@hidden> wrote:
>
> [...]
> On Mac OS X, results are strange:
> 1. The conversion succeeds in 'eucJP', and also produces 2 characters.
> This is a source of the failed test in sed (invalid-mb-seq-UMR.sh),
> as this consumes 1 byte from the input string, and produces two bytes.
>
The above is incorrect. Should've said:
On Mac OS X,
mbrtowc(3) with input = '\262c' incorrectly returned '2',
meaning it *consumed* two bytes and returned wide-char=0xb2e3 .
Later on, the wide-char is converted to multibyte character, resulting in
2-bytes string.
To expand:
The rest 'invalid-mb-seq-UMR.sh' uses '\262C' as input (with additional
upper-case conversion \U ).
On most gnu/linux systems, the flow is:
1. read '\262c'
2. it is detected as invalid multibyte
3. one byte '\262' is consumed, and written as-is.
4. the next byte 'c' is consumed, and written (as upper case).
5. The final output is '\262C' (0xB2 0x43).
6. Test passes.
On Mac OS X, the flow is:
1. read '\262c'
2. it is detected as valid 2-byte multibyte sequence, wide-char value of 0xB2E3
3. 2 bytes are consumed ('\262' and 'c').
4. the wide-char is converted back to multibyte 0xB2 0xE3 and written to output.
5. the test fails.
to be continued,
- assaf