[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Pádraig Brady |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Thu, 21 Jul 2016 11:08:44 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 |
On 21/07/16 07:04, Assaf Gordon wrote:
> Hello,
>
>> On Jul 20, 2016, at 09:04, Eric Blake <address@hidden> wrote:
>>
>> On 07/20/2016 06:21 AM, Pádraig Brady wrote:
>>
>>> It's worth considering having a separate (already existing?) util
>>> to fix data before processing.
> [...]
>>
>> [...] I wonder if we should give ALL utilities a common
>> --encoding-error=POLICY option that allows runtime selection between the
>> three policies, and/or an environment variable that selects the default
>> policy in absence of a command line choice.
>
> I like the idea of a dedicated program - does one thing and hopefully does it
> well...
> Even if we end up implementing the policy logic in all the common utilities,
> it's a good start to discuss the policy itself.
>
> Attached is a suggestion for such a program ('mbfix' is just a temporary
> name).
> Usage examples below, and of course it's just a baseline for further
> discussion.
> The program does not perform any conversions between encoding: it only checks
> for valid sequences in the current locale.
>
> I'm aware of four existing utilities that have overlapping functionality:
>
> 1.
> gnu glibc's iconv: seems like the default iconv that comes with the default
> glibc does not support the --byte-subst parameter, so can't be easily used to
> replace invalid sequences.
>
> 2.
> gnu's libiconv: iconv from libiconv does support "--byte-subst" and "-c"
> options. However, it seems this package is not readily available on many
> gnu/linux distributions, and anecdotally proved tricky to install from git or
> latest tarball.
>
> 3.
> ICU's uconv (http://site.icu-project.org): uconv supports several methods to
> handle invalid data (called 'callbacks' in their man page). Examples:
> $ printf 'ab\342cdef' | uconv
> Conversion to Unicode from codepage failed at input byte position 2.
> Bytes: e2 Error: Illegal character found
> $ printf 'ab\342cdef' | uconv --callback substitute
> ab�cdef
> $ printf 'ab\342cdef' | uconv --callback escape-c
> ab\xE2cdef
>
> However, the ICU package is large, requires C++, and brings in lots of
> dependancies. This might not be desirable in all environments.
>
> 4.
> 'recode' - from a cursory look it supports silently ignoring invalid
> character, but I found no way to transform invalid input. I'm also not sure
> if this program is still actively maintained (comments are welcomed about it).
>
>
> ===
>
> As such, it might be worth considering adding a dedicated program.
> Below are usage examples of the attached program.
>
> Feedback and suggestions are very welcomed,
> - assaf
>
>
>
>
>
> Valid input is printed as-is:
>
> $ printf 'ab\u2461cdef' | ./src/mbfix
> ab②cdef
>
>
> '--check' option allows scripting:
>
> $ printf 'ab\u2461cdef' | ./src/mbfix --check && echo ok
> ok
>
> $ printf 'ab\342cdef' | ./src/mbfix --check && echo ok || echo fail
> ./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte
> sequence, octet 0xe2 / 0342
> fail
>
>
> encoding policies: 'discard'
>
> $ printf 'ab\342cdef' | ./src/mbfix --policy=discard
> abcdef
>
> 'discard' with verbose printing:
>
> $ printf 'ab\342cdef' | ./src/mbfix --policy=discard --verbose > a
> ./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte
> sequence, octet 0xe2 / 0342
>
> $ cat a
> abcdef
>
>
> 'abort' policy stops at first invalid sequence:
>
> $ printf 'ab\342cdef' | ./src/mbfix --policy=abort
> ab./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte
> sequence, octet 0xe2 / 0342
>
>
> 'replace' policy uses a fixed replacement character:
>
> $ printf 'ab\342cdef' | ./src/mbfix --policy=replace
> ab�cdef
>
> 'replace' can use a custom unicode character:
>
> $ printf 'ab\342cdef' | ./src/mbfix --policy=replace --replace-char=0x2665
> ab♥cdef
>
> 'recode' uses printf to output the invalid octet:
>
> $ printf 'ab\342cdef' | ./src/mbfix --policy=recode
> ab<0xe2>cdef
>
> 'recode' with custom format:
>
> $ printf 'ab\342cdef' | ./src/mbfix --policy=recode
> --recode-format="<INVALID=0x%03o>"
> ab<INVALID=0x342>cdef
>
>
> $ ./src/mbfix --help
> Usage: ./src/mbfix [OPTION]... [FILE]...
> Fix and adjust multibyte character in files
>
> Mandatory arguments to long options are mandatory for short options too.
> -c, --check validate input, no output
> -p, --policy=POLICY invalid-input policy: discard, abort
> replace (default), recode
> --replace-char=N
> with 'replace' policy, use unicode character N
> (default: 0xFFFD 'REPLACEMENT CHARACTER')
> --recode-format=FMT
> with 'recode' policy, recode invalid octets
> using FMT printf-format (default: '<0x%02x>')
> -v, --verbose report location of invalid input
> -z, --zero-terminated line delimiter is NUL, not newline
> --help display this help and exit
> --version output version information and exit
It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would
also be quite cohesive in such a util.
A related thread:
http://lists.gnu.org/archive/html/bug-coreutils/2009-02/threads.html#00224
thanks,
Pádraig
- multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long),
Pádraig Brady <=
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28