[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)

From: Assaf Gordon
Subject: Re: multibyte processing - handling invalid sequences (long)
Date: Thu, 21 Jul 2016 02:04:13 -0400


> On Jul 20, 2016, at 09:04, Eric Blake <address@hidden> wrote:
> On 07/20/2016 06:21 AM, Pádraig Brady wrote:
>> It's worth considering having a separate (already existing?) util
>> to fix data before processing.
> [...] I wonder if we should give ALL utilities a common
> --encoding-error=POLICY option that allows runtime selection between the
> three policies, and/or an environment variable that selects the default
> policy in absence of a command line choice.

I like the idea of a dedicated program - does one thing and hopefully does it 
Even if we end up implementing the policy logic in all the common utilities, 
it's a good start to discuss the policy itself.

Attached is a suggestion for such a program ('mbfix' is just a temporary name).
Usage examples below, and of course it's just a baseline for further discussion.
The program does not perform any conversions between encoding: it only checks 
for valid sequences in the current locale.

I'm aware of four existing utilities that have overlapping functionality:

gnu glibc's iconv: seems like the default iconv that comes with the default 
glibc does not support the --byte-subst parameter, so can't be easily used to 
replace invalid sequences.

gnu's libiconv: iconv from libiconv does support "--byte-subst" and "-c" 
options. However, it seems this package is not readily available on many 
gnu/linux distributions, and anecdotally proved tricky to install from git or 
latest tarball.

ICU's uconv ( uconv supports several methods to 
handle invalid data (called 'callbacks' in their man page). Examples:
   $ printf 'ab\342cdef' | uconv
   Conversion to Unicode from codepage failed at input byte position 2. Bytes: 
e2 Error: Illegal character found
   $ printf 'ab\342cdef' | uconv --callback substitute
   $ printf 'ab\342cdef' | uconv --callback escape-c

However, the ICU package is large, requires C++, and brings in lots of 
dependancies. This might not be desirable in all environments.

'recode' - from a cursory look it supports silently ignoring invalid character, 
but I found no way to transform invalid input. I'm also not sure if this 
program is still actively maintained (comments are welcomed about it).


As such, it might be worth considering adding a dedicated program.
Below are usage examples of the attached program.

Feedback and suggestions are very welcomed,
 - assaf

Valid input is printed as-is:

  $ printf 'ab\u2461cdef' | ./src/mbfix 

'--check' option allows scripting:

  $ printf 'ab\u2461cdef' | ./src/mbfix --check && echo ok

  $ printf 'ab\342cdef' | ./src/mbfix --check && echo ok || echo fail
  ./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
sequence, octet 0xe2 / 0342

encoding policies: 'discard'

  $ printf 'ab\342cdef' | ./src/mbfix --policy=discard

'discard' with verbose printing:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=discard --verbose > a
  ./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
sequence, octet 0xe2 / 0342

  $ cat a

'abort' policy stops at first invalid sequence:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=abort
  ab./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
sequence, octet 0xe2 / 0342

'replace' policy uses a fixed replacement character:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=replace

'replace' can use a custom unicode character:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=replace --replace-char=0x2665

'recode' uses printf to output the invalid octet:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=recode

'recode' with custom format:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=recode 

$ ./src/mbfix --help
Usage: ./src/mbfix [OPTION]... [FILE]...
Fix and adjust multibyte character in files

Mandatory arguments to long options are mandatory for short options too.
  -c, --check          validate input, no output
  -p, --policy=POLICY  invalid-input policy: discard, abort
                       replace (default), recode
                       with 'replace' policy, use unicode character N
                       (default: 0xFFFD 'REPLACEMENT CHARACTER')
                       with 'recode' policy, recode invalid octets
                       using FMT printf-format (default: '<0x%02x>')
  -v, --verbose        report location of invalid input
  -z, --zero-terminated    line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <>
Full documentation at: <>
or available locally via: info '(coreutils) mbfix invocation'

reply via email to

[Prev in Thread] Current Thread [Next in Thread]