[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Assaf Gordon |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Thu, 21 Jul 2016 23:23:22 -0400 |
Hello,
> On Jul 21, 2016, at 06:08, Pádraig Brady <address@hidden> wrote:
> [...]
> It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would
> also be quite cohesive in such a util.
Attached an improved version with unicode normalization support.
Before continuing with other stuff (e.g. more tests, documentation, news, etc.),
it's worth discussing if this is the path to take (or if we want to add this to
each individual utility).
Also, do we keep these options or modify them?
e.g. 'uconv' uses different terminology for handling invalid sequences: stop,
skip, substitute, escape (corresponding to abort, discard, replace, recode
below).
To keep the implementation simple, unicode normalization requires UTF-8 locales
- is this a valid requirement?
And of course, what about the name?
Comments welcomed,
- assaf
Example (from 'Unicode Explained' book):
===========
$ printf '\uFB01anc\u00E9\n'
fiancé
$ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfd | od -An -tx1c
ef ac 81 61 6e 63 65 cc 81 0a
? ? 201 a n c e ? 201 \n
$ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfc | od -An -tx1c
ef ac 81 61 6e 63 c3 a9 0a
? ? 201 a n c ? ? \n
$ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkd | od -An -tx1c
66 69 61 6e 63 65 cc 81 0a
f i a n c e ? 201 \n
$ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkc | od -An -tx1c
66 69 61 6e 63 c3 a9 0a
f i a n c ? ? \n
$ ./src/mbfix --help
Usage: ./src/mbfix [OPTION]... [FILE]...
Fix and adjust multibyte character in files
Mandatory arguments to long options are mandatory for short options too.
-A, --abort same as --policy=abort
-C, --recode same as --policy=recode
-c, --check validate input, no output
-D, --discard same as --policy=discard
-n, --normalization=NORM
apply unicode normalization NORM:, one of:
nfd, nfc, nfkd, nfkc. Normalization requires
UTF-8 locales.
-p, --policy=POLICY invalid-input policy: discard, abort
replace (default), recode
-R, --replace same as --policy=replace
--replace-char=N
with 'replace' policy, use unicode character N
(default: 0xFFFD 'REPLACEMENT CHARACTER')
--recode-format=FMT
with 'recode' policy, recode invalid octets
using FMT printf-format (default: '<0x%02x>')
-v, --verbose report location of invalid input
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Full documentation at: <http://www.gnu.org/software/coreutils/mbfix>
or available locally via: info '(coreutils) mbfix invocation'
====
0001-mbfix-a-new-program-to-fix-invalid-multibyte-files.patch.xz
Description: Binary data
- multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long),
Assaf Gordon <=
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28