coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)


From: Eric Blake
Subject: Re: multibyte processing - handling invalid sequences (long)
Date: Thu, 28 Jul 2016 11:18:45 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 07/20/2016 07:04 AM, Eric Blake wrote:
> On 07/20/2016 06:21 AM, Pádraig Brady wrote:
> 
>> It's worth considering having a separate (already existing?) util
>> to fix data before processing. That could have options to:
>>   drop invalid chars, replace with replacement char,
>>   apply various http://unicode.org/reports/tr15/#Norm_Forms,
>>   convert enclosed forms like ㊷ to 42 etc.
>> I.E. we should avoid complicating each util where possible,
>> and at least avoid having options on each util that could be
>> hoisted to a more general util like above.
>>
>> Silently dropping invalid characters probably isn't a great idea,
>> and warnings to stderr is a bit messy and could be seen to contradict
>> POSIX which suggests exiting with failure if anything output to stderr.
>> A compromise might be to just replace invalid chars with
>> the replacement character � and then include that in
>> normal character processing, to make issues in input apparent.
> 
> Since there are several plausible error-handling methods (silently
> discard invalid input, flag input as invalid with an error and no
> further output, convert invalid input into replacement character and
> proceed with output), all of which can be considered desirable in some
> circumstances, I wonder if we should give ALL utilities a common
> --encoding-error=POLICY option that allows runtime selection between the
> three policies, and/or an environment variable that selects the default
> policy in absence of a command line choice.

Interestingly enough, today's POSIX phone call started discussions on
how iconv() needs to be enhanced to support multiple error handling modes:

http://austingroupbugs.net/bug_view_page.php?bug_id=1007


-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]