[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Assaf Gordon |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Tue, 26 Jul 2016 22:47:58 -0400 |
> On Jul 23, 2016, at 16:30, Pádraig Brady <address@hidden> wrote:
>
> On 23/07/16 19:05, Assaf Gordon wrote:
>>
>>> On Jul 23, 2016, at 06:51, Pádraig Brady <address@hidden> wrote:
>>> I was wondering about the tool being line/record oriented.
>>>
>>> Disadvantages are:
>>> requires arbitrary large buffers for arbitrary long lines
>>> relatively slow in the presence of short/normal lines
>>> sensitive to the current stdio buffering mode
>>> requires -z option to support NUL termination
>>>
>>> Processing instead a block at a time avoid such issues.
>>> UTF-8 at least is self synchronising, so after reading a block
>>> you just have to look at the last 3 bytes to know
>>> how many to append to the start of the next block.
Attached is a partial, crude implementation of stream-based processing.
It currently only handles fixing invalid sequences, no unicode normalization
yet.
It contains both implementation, to ease comparison (use "-S/--stream" to use
the new implementation, or without to use the previous line-based
implementation).
The main functions are (to facilitate discussion):
mbbuf_read - reads more data from the input, moves 'incomplete/left-over'
octets from previous read to the beginning of the buffer (somewhat like grep's
fillbuf() but not as sophisticated).
STRM_unorm_buf - iterates over the octets in the current buffer
STRM_unorm_fd - repeatedly reads the file and calls STRM_unorm_buf.
The tests use both methods and the results are identical (except unicode
normalization with is currently skipped for --stream).
Few issues are emerging:
1. If only validation is requires (i.e. no unicode normalization), it'll be
wasteful to convert the input to wchar_t then back again. It'll be better to
write the output as-is. If unicode normalization is requested, then going
through wchar_t and uniform's filter is needed. Perhaps two separate dedicated
functions would be more efficient.
2. Regarding skipping STDIO buffering: I assume you referred to dealing with
input. The code now uses file-descriptors and 'safe_read', thus bypassing stdio
buffering on input. But it still uses stdio for output (this seems in line with
tac, split, tr, etc.). If we want to bypass stdio as well, some extra code for
internal buffering might be needed.
3. I believe that for this tool to be really useful, it should report the line
number and column of offending/invalid octets. In that case, the code needs to
count lines / columns, and will need to be aware of which line-terminator is
used - meaning "-z" is still needed.
The attached code does count lines/columns (see struct mbbuffer), and thus is a
bit cumbersome.
Currently it seems this optimization leads to somewhat more complicated code.
Once I'll have the unicode normalization implemented we could compare speeds
and see which method is preferred.
Comments very welcomed,
- assaf
0001-unorm-a-new-program-to-fix-and-normalize-multibyte-f.patch.xz
Description: Binary data
- multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long),
Assaf Gordon <=
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28