[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Assaf Gordon |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Sat, 23 Jul 2016 14:05:25 -0400 |
> On Jul 23, 2016, at 06:51, Pádraig Brady <address@hidden> wrote:
> I was wondering about the tool being line/record oriented.
>
> Disadvantages are:
> requires arbitrary large buffers for arbitrary long lines
> relatively slow in the presence of short/normal lines
> sensitive to the current stdio buffering mode
> requires -z option to support NUL termination
>
> Processing instead a block at a time avoid such issues.
> UTF-8 at least is self synchronising, so after reading a block
> you just have to look at the last 3 bytes to know
> how many to append to the start of the next block.
block-at-a-time would work well for detecting/fixing invalid multibyte
sequences, especially in UTF-8.
But I'm not sure about other multibyte encodings (I'll have to investigate).
However, for unicode normalization, I am not sure there's a stream interface to
it (gnu lib's uniform takes a whole string to normalize). IIUC, normalization
requires being able to examine some unicode characters ahead.
-assaf
- multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long),
Assaf Gordon <=
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28