[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Eric Blake |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Thu, 28 Jul 2016 11:18:45 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 |
On 07/20/2016 07:04 AM, Eric Blake wrote:
> On 07/20/2016 06:21 AM, Pádraig Brady wrote:
>
>> It's worth considering having a separate (already existing?) util
>> to fix data before processing. That could have options to:
>> drop invalid chars, replace with replacement char,
>> apply various http://unicode.org/reports/tr15/#Norm_Forms,
>> convert enclosed forms like ㊷ to 42 etc.
>> I.E. we should avoid complicating each util where possible,
>> and at least avoid having options on each util that could be
>> hoisted to a more general util like above.
>>
>> Silently dropping invalid characters probably isn't a great idea,
>> and warnings to stderr is a bit messy and could be seen to contradict
>> POSIX which suggests exiting with failure if anything output to stderr.
>> A compromise might be to just replace invalid chars with
>> the replacement character � and then include that in
>> normal character processing, to make issues in input apparent.
>
> Since there are several plausible error-handling methods (silently
> discard invalid input, flag input as invalid with an error and no
> further output, convert invalid input into replacement character and
> proceed with output), all of which can be considered desirable in some
> circumstances, I wonder if we should give ALL utilities a common
> --encoding-error=POLICY option that allows runtime selection between the
> three policies, and/or an environment variable that selects the default
> policy in absence of a command line choice.
Interestingly enough, today's POSIX phone call started discussions on
how iconv() needs to be enhanced to support multiple error handling modes:
http://austingroupbugs.net/bug_view_page.php?bug_id=1007
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
- Re: multibyte processing - handling invalid sequences (long), (continued)
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
- Re: multibyte processing - handling invalid sequences (long),
Eric Blake <=