[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
From: |
Pádraig Brady |
Subject: |
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS |
Date: |
Sun, 22 Jul 2018 09:25:18 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
On 21/07/18 15:43, Bruno Haible wrote:
> Hi Pádraig,
>
>> I've attached a gnulib patch to document for iscntrl at least.
>
>> +This function does not support arguments outside of the range of the
>> +unsigned char type in locales with large character sets, on some platforms.
>> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.
>
> In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().
>
> POSIX [1] says
> "The c argument is a type int, the value of which the application shall
> ensure is a character representable as an unsigned char or equal to the
> value of the macro EOF. If the argument has any other value, the behavior
> is undefined."
>
> The term "character" is defined here [2]:
> "A sequence of one or more bytes representing a single graphic symbol or
> control code."
>
> So, in a UTF-8 locale, a "character representable as an unsigned char"
> is a byte sequence of length 1, where the single byte has a value in the
> range 0x00..0x7F.
>
> For invalid values "the behavior is undefined." You were expecting a value 0.
>
> Now, in the gnulib documentations, what we mention as portability problems
> are the cases where
> - the behaviour for valid arguments is different on different platforms, or
> - the boundary between valid and invalid arguments is fuzzy and depends on
> the platform.
> IMO there's no point in documenting that a function _really_ has undefined
> behaviour when POSIX says that it has undefined behaviour.
Thanks for all that info. I agree iscntrl() behavior on macOS is within spec,
though is still surprising, and different from other systems.
I agree docs should be as succinct as possible, though...
>> I've also attached an alternative patch for df (in your name).
>
> This patch is correct (because the characters that you test for in c_iscntrl
> are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
... It might be worth mentioning this subtle point in the c_iscntrl() docs?
"Note this identifies all single byte control chars even in multibyte
encodings".
> But it does not catch control characters outside of the ASCII range. It would
> make sense to catch these as well. If you want to do that,
> 'hide_problematic_chars' needs to be rewritten as a loop that iterates across
> the multibyte characters. For example with the 'mbiter' module, in
> combination with the mb_iscntrl function from the 'mbchar' module. Or
> directly with mbrtowc() and iswcntrl().
I was mainly worried here about \n for scripts to robustly parse df output.
cheers,
Pádraig.
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Pádraig Brady, 2018/07/21
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Bruno Haible, 2018/07/21
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS,
Pádraig Brady <=
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Bruno Haible, 2018/07/22
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Chih-Hsuan Yen, 2018/07/25
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Paul Eggert, 2018/07/26
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Bruno Haible, 2018/07/26
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Pádraig Brady, 2018/07/26
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Paul Eggert, 2018/07/26
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Bruno Haible, 2018/07/27
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Paul Eggert, 2018/07/27
- Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Chih-Hsuan Yen, 2018/07/29
Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS, Paul Eggert, 2018/07/22