[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Pádraig Brady |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Sat, 23 Jul 2016 11:51:31 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 |
On 22/07/16 04:23, Assaf Gordon wrote:
> Hello,
>
>> On Jul 21, 2016, at 06:08, Pádraig Brady <address@hidden> wrote:
>> [...]
>> It seems like --normalization={NFKD,NFKD,NFC,NFD} functionality would
>> also be quite cohesive in such a util.
>
> Attached an improved version with unicode normalization support.
>
> Before continuing with other stuff (e.g. more tests, documentation, news,
> etc.),
> it's worth discussing if this is the path to take (or if we want to add this
> to each individual utility).
> Also, do we keep these options or modify them?
> e.g. 'uconv' uses different terminology for handling invalid sequences: stop,
> skip, substitute, escape (corresponding to abort, discard, replace, recode
> below).
>
> To keep the implementation simple, unicode normalization requires UTF-8
> locales - is this a valid requirement?
>
> And of course, what about the name?
>
> Comments welcomed,
> - assaf
>
>
>
>
> Example (from 'Unicode Explained' book):
> ===========
> $ printf '\uFB01anc\u00E9\n'
> fiancé
>
> $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfd | od -An -tx1c
> ef ac 81 61 6e 63 65 cc 81 0a
> ? ? 201 a n c e ? 201 \n
>
> $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfc | od -An -tx1c
> ef ac 81 61 6e 63 c3 a9 0a
> ? ? 201 a n c ? ? \n
>
> $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkd | od -An -tx1c
> 66 69 61 6e 63 65 cc 81 0a
> f i a n c e ? 201 \n
>
> $ printf '\uFB01anc\u00E9\n' | ./src/mbfix -n nfkc | od -An -tx1c
> 66 69 61 6e 63 c3 a9 0a
> f i a n c ? ? \n
>
> $ ./src/mbfix --help
> Usage: ./src/mbfix [OPTION]... [FILE]...
> Fix and adjust multibyte character in files
>
> Mandatory arguments to long options are mandatory for short options too.
> -A, --abort same as --policy=abort
> -C, --recode same as --policy=recode
> -c, --check validate input, no output
> -D, --discard same as --policy=discard
> -n, --normalization=NORM
> apply unicode normalization NORM:, one of:
> nfd, nfc, nfkd, nfkc. Normalization requires
> UTF-8 locales.
> -p, --policy=POLICY invalid-input policy: discard, abort
> replace (default), recode
> -R, --replace same as --policy=replace
> --replace-char=N
> with 'replace' policy, use unicode character N
> (default: 0xFFFD 'REPLACEMENT CHARACTER')
> --recode-format=FMT
> with 'recode' policy, recode invalid octets
> using FMT printf-format (default: '<0x%02x>')
> -v, --verbose report location of invalid input
> -z, --zero-terminated line delimiter is NUL, not newline
> --help display this help and exit
> --version output version information and exit
I was wondering about the tool being line/record oriented.
Disadvantages are:
requires arbitrary large buffers for arbitrary long lines
relatively slow in the presence of short/normal lines
sensitive to the current stdio buffering mode
requires -z option to support NUL termination
Processing instead a block at a time avoid such issues.
UTF-8 at least is self synchronising, so after reading a block
you just have to look at the last 3 bytes to know
how many to append to the start of the next block.
Pádraig.
- multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long),
Pádraig Brady <=
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28