[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Pádraig Brady |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Wed, 20 Jul 2016 13:21:01 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 |
On 20/07/16 07:11, Assaf Gordon wrote:
> Hello all,
>
> I'd like to discuss few aspect of multibyte processing in coreutils, as a
> preparation for future improvements.
>
> To start with an "easy" topic: how to handle invalid input (i.e. input octets
> that result in invalid multibyte sequence).
> Previous discussion said no internal conversion to wchar_t so that invalid
> sequences can be handled as C locale (
> https://lists.gnu.org/archive/html/coreutils/2010-09/msg00051.html ).
> Pádraig's i18n plan left the handling issue open ("How do we handle invalid
> encodings; substitution, elision, leaving in place?",
> http://www.pixelbeat.org/docs/coreutils_i18n/).
>
> Is there an agreement on how to handle those?
>
> Do we want to fall-back to C locale, and does that imply going back and
> revising invalid octets and re-processing them as single-byte characters ?
> If so, the implementation need to keep the N octets (up to MB_CUR_MAX), and
> be able to go back and process them. Alternatively, we can treat only the
> last octet (the offending one that caused the sequence to be invalid) as a
> single-byte character, thus possibly losing data.
>
>
> One possibility is to have all programs print an informative warning to
> stderr upon the detection of the first invalid multibyte sequence, then
> resort to 'best-effort' (e.g. only the last octet, or something else that's
> easy to implement).
> My rational is that for an input file with invalid sequences, there is no one
> correct solution that would satisfy all cases: some users would think the
> obvious correct solution is to output invalid sequences as-is, others would
> think they should be silently ignored (i.e. a program should never generate
> invalid output even on invalid input).
> The best we could do is warn them, and document a way to fix invalid files
> (along the lines of 'iconv --byte-subst="<0x%x>"'). Users could always
> fallback to forcing C locale and then all input bytes will be processed.
>
>
>
> To be more concrete, here are some examples:
>
> The unicode code-point U+2460 is 'CIRCLED DIGIT ONE',
> in UTF-8 octal: printf '\342\221\240'
> I'll use the invalid sequence '\342\221\300' as input below.
>
> What should be the output in the following cases:
>
> 'cut': should it print '\300' or '\342' ?
>
> printf '\342\221\300' | LC_ALL=en_US.UTF-8 cut -c1
>
>
> 'wc': should it print 1 (counting only '\300') or 3 (counting all octets) or
> 0 ?
> currently it prints 0 because it doesn't count invalid multibyte characters.
>
> printf '\342\221\300' | LC_ALL=en_US.UTF-8 wc -m
>
> similar issue, but perhaps with different logic and rationale, with "wc -L".
>
>
> 'expand': should this be expanded to '\300' + 7 spaces + 'A',
> or '\342\221\300' + 5 spaces + 'A' ? or something else ?
>
> printf '\342\221\300\tA\n' | LC_ALL=en_US.UTF-8 expand
>
>
>
> 'fold': should this print: 'aa\342\n\221\300b\n' (treating them as
> single-bytes), or
> 'aa\300\nb\n' (using only the last octet), or something else?
>
> printf 'aa\342\221\300b\n' | LC_ALL=en_US.UTF-8 fold -w 3
>
>
> 'printf' - deals only with bytes. e.g. the following should be printed as-is:
>
> env printf '%s\n' "$(env printf '\342\221\300')"
> env printf "$(env printf '\342\221\300')"
>
>
> 'fmt' and 'pr': I assume they should print the invalid sequence as is, as
> they do not break mid-words.
>
> 'head', 'tail', 'split' - not relevant as they deal with bytes, not
> characters.
>
> 'csplit': only indirectly relevant, as I seem to remember that standard regex
> should never
> match an invalid multibyte sequence?
>
> 'shuf','paste' - not relevant as it deals with complete lines.
>
> 'yes' - prints input as-is, e.g. the following works:
>
> yes "$(env printf '\342\221\300')"
>
> 'test' - operators '-n' and '-z' work correctly with invalid sequences.
>
> 'expr': regex operations should never match (IIUC).
> for 'substr', should this return '\300' or '\342' ?
>
> LC_ALL=en_US.UTF-8 expr substr "$(printf '\342\221\300')" 1 1
>
> for 'length', should this return 3 (treating as 3 single-bytes) or 1
> (counting the last offending octet)?
>
> LC_ALL=en_US.UTF-8 expr length "$(printf '\342\221\300')"
>
> for 'index', both STRING and CHAR might be invalid. Should an invalid CHAR
> parameter be rejected outright ?
>
> 'numfmt' - as long as it doesn't get confused with a digit character, invalid
> sequences should be printed 'as-is'.
>
> 'seq' - doesn't take any input.
>
> 'date' - should print invalid characters in format string as-is.
>
>
> For now I'm going to side-step sort+join+uniq, as I think they present a more
> complicated set of issues when it comes to multibyte processing.
>
>
> comments very welcomed,
It's worth considering having a separate (already existing?) util
to fix data before processing. That could have options to:
drop invalid chars, replace with replacement char,
apply various http://unicode.org/reports/tr15/#Norm_Forms,
convert enclosed forms like ㊷ to 42 etc.
I.E. we should avoid complicating each util where possible,
and at least avoid having options on each util that could be
hoisted to a more general util like above.
Silently dropping invalid characters probably isn't a great idea,
and warnings to stderr is a bit messy and could be seen to contradict
POSIX which suggests exiting with failure if anything output to stderr.
A compromise might be to just replace invalid chars with
the replacement character � and then include that in
normal character processing, to make issues in input apparent.
cheers,
Pádraig
- multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long),
Pádraig Brady <=
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27