[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Assaf Gordon |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Thu, 21 Jul 2016 02:04:13 -0400 |
Hello,
> On Jul 20, 2016, at 09:04, Eric Blake <address@hidden> wrote:
>
> On 07/20/2016 06:21 AM, Pádraig Brady wrote:
>
>> It's worth considering having a separate (already existing?) util
>> to fix data before processing.
[...]
>
> [...] I wonder if we should give ALL utilities a common
> --encoding-error=POLICY option that allows runtime selection between the
> three policies, and/or an environment variable that selects the default
> policy in absence of a command line choice.
I like the idea of a dedicated program - does one thing and hopefully does it
well...
Even if we end up implementing the policy logic in all the common utilities,
it's a good start to discuss the policy itself.
Attached is a suggestion for such a program ('mbfix' is just a temporary name).
Usage examples below, and of course it's just a baseline for further discussion.
The program does not perform any conversions between encoding: it only checks
for valid sequences in the current locale.
I'm aware of four existing utilities that have overlapping functionality:
1.
gnu glibc's iconv: seems like the default iconv that comes with the default
glibc does not support the --byte-subst parameter, so can't be easily used to
replace invalid sequences.
2.
gnu's libiconv: iconv from libiconv does support "--byte-subst" and "-c"
options. However, it seems this package is not readily available on many
gnu/linux distributions, and anecdotally proved tricky to install from git or
latest tarball.
3.
ICU's uconv (http://site.icu-project.org): uconv supports several methods to
handle invalid data (called 'callbacks' in their man page). Examples:
$ printf 'ab\342cdef' | uconv
Conversion to Unicode from codepage failed at input byte position 2. Bytes:
e2 Error: Illegal character found
$ printf 'ab\342cdef' | uconv --callback substitute
ab�cdef
$ printf 'ab\342cdef' | uconv --callback escape-c
ab\xE2cdef
However, the ICU package is large, requires C++, and brings in lots of
dependancies. This might not be desirable in all environments.
4.
'recode' - from a cursory look it supports silently ignoring invalid character,
but I found no way to transform invalid input. I'm also not sure if this
program is still actively maintained (comments are welcomed about it).
===
As such, it might be worth considering adding a dedicated program.
Below are usage examples of the attached program.
Feedback and suggestions are very welcomed,
- assaf
Valid input is printed as-is:
$ printf 'ab\u2461cdef' | ./src/mbfix
ab②cdef
'--check' option allows scripting:
$ printf 'ab\u2461cdef' | ./src/mbfix --check && echo ok
ok
$ printf 'ab\342cdef' | ./src/mbfix --check && echo ok || echo fail
./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte
sequence, octet 0xe2 / 0342
fail
encoding policies: 'discard'
$ printf 'ab\342cdef' | ./src/mbfix --policy=discard
abcdef
'discard' with verbose printing:
$ printf 'ab\342cdef' | ./src/mbfix --policy=discard --verbose > a
./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte
sequence, octet 0xe2 / 0342
$ cat a
abcdef
'abort' policy stops at first invalid sequence:
$ printf 'ab\342cdef' | ./src/mbfix --policy=abort
ab./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte
sequence, octet 0xe2 / 0342
'replace' policy uses a fixed replacement character:
$ printf 'ab\342cdef' | ./src/mbfix --policy=replace
ab�cdef
'replace' can use a custom unicode character:
$ printf 'ab\342cdef' | ./src/mbfix --policy=replace --replace-char=0x2665
ab♥cdef
'recode' uses printf to output the invalid octet:
$ printf 'ab\342cdef' | ./src/mbfix --policy=recode
ab<0xe2>cdef
'recode' with custom format:
$ printf 'ab\342cdef' | ./src/mbfix --policy=recode
--recode-format="<INVALID=0x%03o>"
ab<INVALID=0x342>cdef
$ ./src/mbfix --help
Usage: ./src/mbfix [OPTION]... [FILE]...
Fix and adjust multibyte character in files
Mandatory arguments to long options are mandatory for short options too.
-c, --check validate input, no output
-p, --policy=POLICY invalid-input policy: discard, abort
replace (default), recode
--replace-char=N
with 'replace' policy, use unicode character N
(default: 0xFFFD 'REPLACEMENT CHARACTER')
--recode-format=FMT
with 'recode' policy, recode invalid octets
using FMT printf-format (default: '<0x%02x>')
-v, --verbose report location of invalid input
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/mbfix>
or available locally via: info '(coreutils) mbfix invocation'
- multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long),
Assaf Gordon <=
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28