[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Pádraig Brady |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Sat, 23 Jul 2016 21:30:07 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 |
On 23/07/16 19:05, Assaf Gordon wrote:
>
>> On Jul 23, 2016, at 06:51, Pádraig Brady <address@hidden> wrote:
>> I was wondering about the tool being line/record oriented.
>>
>> Disadvantages are:
>> requires arbitrary large buffers for arbitrary long lines
>> relatively slow in the presence of short/normal lines
>> sensitive to the current stdio buffering mode
>> requires -z option to support NUL termination
>>
>> Processing instead a block at a time avoid such issues.
>> UTF-8 at least is self synchronising, so after reading a block
>> you just have to look at the last 3 bytes to know
>> how many to append to the start of the next block.
>
> block-at-a-time would work well for detecting/fixing invalid multibyte
> sequences, especially in UTF-8.
> But I'm not sure about other multibyte encodings (I'll have to investigate).
>
> However, for unicode normalization, I am not sure there's a stream interface
> to it (gnu lib's uniform takes a whole string to normalize). IIUC,
> normalization requires being able to examine some unicode characters ahead.
Oh right I see.
You're saying that splitting per line is a natural way to ensure
you don't split processing in the middle of a decomposed character,
which is significant in normalization processing.
To support that you'd have to do something like:
filter = uninorm_filter_create()
while (read(fd, buf, BUFSIZE))
for each mbchar;
uchar = mbtowchar(mbchar);
if (!uchar) //fix
uninorm_filter_write(filter, uchar);
uninorm_filter_flush(filter)
I don't know how that would perform compared to u8_normalize().
It might be faster since we're already processing each char?
Or it might be slower if u8_normalize() has some utf8 specific optimizations.
cheers,
Pádraig
- multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long),
Pádraig Brady <=
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28