Re: multibyte processing - handling invalid sequences (long)

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)

From:	Assaf Gordon
Subject:	Re: multibyte processing - handling invalid sequences (long)
Date:	Thu, 21 Jul 2016 02:04:13 -0400

Hello,

> On Jul 20, 2016, at 09:04, Eric Blake <address@hidden> wrote:
> 
> On 07/20/2016 06:21 AM, Pádraig Brady wrote:
> 
>> It's worth considering having a separate (already existing?) util
>> to fix data before processing.
[...]
> 
> [...] I wonder if we should give ALL utilities a common
> --encoding-error=POLICY option that allows runtime selection between the
> three policies, and/or an environment variable that selects the default
> policy in absence of a command line choice.

I like the idea of a dedicated program - does one thing and hopefully does it 
well...
Even if we end up implementing the policy logic in all the common utilities, 
it's a good start to discuss the policy itself.

Attached is a suggestion for such a program ('mbfix' is just a temporary name).
Usage examples below, and of course it's just a baseline for further discussion.
The program does not perform any conversions between encoding: it only checks 
for valid sequences in the current locale.

I'm aware of four existing utilities that have overlapping functionality:

1.
gnu glibc's iconv: seems like the default iconv that comes with the default 
glibc does not support the --byte-subst parameter, so can't be easily used to 
replace invalid sequences.

2.
gnu's libiconv: iconv from libiconv does support "--byte-subst" and "-c" 
options. However, it seems this package is not readily available on many 
gnu/linux distributions, and anecdotally proved tricky to install from git or 
latest tarball.

3.
ICU's uconv (http://site.icu-project.org): uconv supports several methods to 
handle invalid data (called 'callbacks' in their man page). Examples:
   $ printf 'ab\342cdef' | uconv
   Conversion to Unicode from codepage failed at input byte position 2. Bytes: 
e2 Error: Illegal character found
   $ printf 'ab\342cdef' | uconv --callback substitute
   ab�cdef
   $ printf 'ab\342cdef' | uconv --callback escape-c
   ab\xE2cdef

However, the ICU package is large, requires C++, and brings in lots of 
dependancies. This might not be desirable in all environments.

4.
'recode' - from a cursory look it supports silently ignoring invalid character, 
but I found no way to transform invalid input. I'm also not sure if this 
program is still actively maintained (comments are welcomed about it).

===

As such, it might be worth considering adding a dedicated program.
Below are usage examples of the attached program.

Feedback and suggestions are very welcomed,
 - assaf

Valid input is printed as-is:

  $ printf 'ab\u2461cdef' | ./src/mbfix 
  ab②cdef

'--check' option allows scripting:

  $ printf 'ab\u2461cdef' | ./src/mbfix --check && echo ok
  ok

  $ printf 'ab\342cdef' | ./src/mbfix --check && echo ok || echo fail
  ./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
sequence, octet 0xe2 / 0342
  fail

encoding policies: 'discard'

  $ printf 'ab\342cdef' | ./src/mbfix --policy=discard
  abcdef

'discard' with verbose printing:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=discard --verbose > a
  ./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
sequence, octet 0xe2 / 0342

  $ cat a
  abcdef

'abort' policy stops at first invalid sequence:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=abort
  ab./src/mbfix: '(stdin)': line 1 char 2 (byte 2): found invalid multibyte 
sequence, octet 0xe2 / 0342

'replace' policy uses a fixed replacement character:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=replace
  ab�cdef

'replace' can use a custom unicode character:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=replace --replace-char=0x2665
  ab♥cdef

'recode' uses printf to output the invalid octet:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=recode
  ab<0xe2>cdef

'recode' with custom format:

  $ printf 'ab\342cdef' | ./src/mbfix --policy=recode 
--recode-format="<INVALID=0x%03o>"
  ab<INVALID=0x342>cdef

$ ./src/mbfix --help
Usage: ./src/mbfix [OPTION]... [FILE]...
Fix and adjust multibyte character in files

Mandatory arguments to long options are mandatory for short options too.
  -c, --check          validate input, no output
  -p, --policy=POLICY  invalid-input policy: discard, abort
                       replace (default), recode
      --replace-char=N
                       with 'replace' policy, use unicode character N
                       (default: 0xFFFD 'REPLACEMENT CHARACTER')
      --recode-format=FMT
                       with 'recode' policy, recode invalid octets
                       using FMT printf-format (default: '<0x%02x>')
  -v, --verbose        report location of invalid input
  -z, --zero-terminated    line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/mbfix>
or available locally via: info '(coreutils) mbfix invocation'

[Prev in Thread]

Current Thread

[Next in Thread]

multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
  - Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
    - Re: multibyte processing - handling invalid sequences (long), Assaf Gordon <=
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
    - Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
    - Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
    - Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
    - Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28

Prev by Date: Re: multibyte processing - handling invalid sequences (long)
Next by Date: Re: multibyte processing - handling invalid sequences (long)
Previous by thread: Re: multibyte processing - handling invalid sequences (long)
Next by thread: Re: multibyte processing - handling invalid sequences (long)
Index(es):
- Date
- Thread