coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)


From: Pádraig Brady
Subject: Re: multibyte processing - handling invalid sequences (long)
Date: Thu, 21 Jul 2016 11:08:44 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 21/07/16 07:04, Assaf Gordon wrote:
> Hello,
> 
>> On Jul 20, 2016, at 09:04, Eric Blake <address@hidden> wrote:
>>
>> On 07/20/2016 06:21 AM, Pádraig Brady wrote:
>>
>>> It's worth considering having a separate (already existing?) util
>>> to fix data before processing.
> [...]
>>
>> [...] I wonder if we should give ALL utilities a common
>> --encoding-error=POLICY option that allows runtime selection between the
>> three policies, and/or an environment variable that selects the default
>> policy in absence of a command line choice.
> 
> I like the idea of a dedicated program - does one thing and hopefully does it 
> well...
> Even if we end up implementing the policy logic in all the common utilities, 
> it's a good start to discuss the policy itself.
> 
> Attached is a suggestion for such a program ('mbfix' is just a temporary 
> name).
> Usage examples below, and of course it's just a baseline for further 
> discussion.
> The program does not perform any conversions between encoding: it only checks 
> for valid sequences in the current locale.
> 
> I'm aware of four existing utilities that have overlapping functionality:
> 
> 1.
> gnu glibc's iconv: seems like the default iconv that comes with the default 
> glibc does not support the --byte-subst parameter, so can't be easily used to 
> replace invalid sequences.
> 
> 2.
> gnu's libiconv: iconv from libiconv does support "--byte-subst" and "-c" 
> options. However, it seems this package is not readily available on many 
> gnu/linux distributions, and anecdotally proved tricky to install from git or 
> latest tarball.
> 
> 3.
> ICU's uconv (http://site.icu-project.org): uconv supports several methods to 
> handle invalid data (called 'callbacks' in their man page). Examples:
>    $ printf 'ab\342cdef' | uconv
>    Conversion to Unicode from codepage failed at input byte position 2. 
> Bytes: e2 Error: Illegal character found
>    $ printf 'ab\342cdef' | uconv --callback substitute
>    ab�cdef
>    $ printf 'ab\342cdef' | uconv --callback escape-c
>    ab\xE2cdef
> 
> However, the ICU package is large, requires C++, and brings in lots of 
> dependancies. This might not be desirable in all environments.
> 
> 4.
> 'recode' - from a cursory look it supports silently ignoring invalid 
> character, but I found no way to transform invalid input. I'm also not sure 
> if this program is still actively maintained (comments are welcomed about it).
> 
> 
> ===
> 
> As such, it might be worth considering adding a dedicated program.
> Below are usage examples of the attached program.
> 
> Feedback and suggestions are very welcomed,
>  - assaf
> 
> 
> 
> 
> 
> Valid input is printed as-is:
> 
>   $ printf 'ab\u2461cdef' | ./src/mbfix 
>   ab②cdef
> 
> 
> '--check' option allows scripting:
> 
>   $ printf 'ab\u2461cdef' | ./src/mbfix --check && echo ok
>   ok
> 
>   $ printf 'ab\342cdef' | ./src/mbfix --check && echo ok || echo fail
>   ./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
> sequence, octet 0xe2 / 0342
>   fail
> 
> 
> encoding policies: 'discard'
> 
>   $ printf 'ab\342cdef' | ./src/mbfix --policy=discard
>   abcdef
> 
> 'discard' with verbose printing:
> 
>   $ printf 'ab\342cdef' | ./src/mbfix --policy=discard --verbose > a
>   ./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
> sequence, octet 0xe2 / 0342
> 
>   $ cat a
>   abcdef
> 
> 
> 'abort' policy stops at first invalid sequence:
> 
>   $ printf 'ab\342cdef' | ./src/mbfix --policy=abort
>   ab./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
> sequence, octet 0xe2 / 0342
> 
> 
> 'replace' policy uses a fixed replacement character:
> 
>   $ printf 'ab\342cdef' | ./src/mbfix --policy=replace
>   ab�cdef
> 
> 'replace' can use a custom unicode character:
> 
>   $ printf 'ab\342cdef' | ./src/mbfix --policy=replace --replace-char=0x2665
>   ab♥cdef
> 
> 'recode' uses printf to output the invalid octet:
> 
>   $ printf 'ab\342cdef' | ./src/mbfix --policy=recode
>   ab<0xe2>cdef
> 
> 'recode' with custom format:
> 
>   $ printf 'ab\342cdef' | ./src/mbfix --policy=recode 
> --recode-format="<INVALID=0x%03o>"
>   ab<INVALID=0x342>cdef
> 
> 
> $ ./src/mbfix --help
> Usage: ./src/mbfix [OPTION]... [FILE]...
> Fix and adjust multibyte character in files
> 
> Mandatory arguments to long options are mandatory for short options too.
>   -c, --check          validate input, no output
>   -p, --policy=POLICY  invalid-input policy: discard, abort
>                        replace (default), recode
>       --replace-char=N
>                        with 'replace' policy, use unicode character N
>                        (default: 0xFFFD 'REPLACEMENT CHARACTER')
>       --recode-format=FMT
>                        with 'recode' policy, recode invalid octets
>                        using FMT printf-format (default: '<0x%02x>')
>   -v, --verbose        report location of invalid input
>   -z, --zero-terminated    line delimiter is NUL, not newline
>       --help     display this help and exit
>       --version  output version information and exit

It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would
also be quite cohesive in such a util.

A related thread:
http://lists.gnu.org/archive/html/bug-coreutils/2009-02/threads.html#00224

thanks,
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]