[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte processing - handling invalid sequences (long)
From: |
Assaf Gordon |
Subject: |
Re: multibyte processing - handling invalid sequences (long) |
Date: |
Sun, 7 Aug 2016 01:34:11 -0400 |
Hello,
Attached an improved version of 'unorm', with unicode-normalization support for
both line-by-line and buffer/stream methods.
The code is not yet cleaned-up, but enables comparing performances of the two
approaches.
Briefly, it seems the buffer/stream method is more or less as fast as
line-by-line when *not* doing unicode normalization
(only one 'mbrtowc' call is done for each input character in both cases).
When unicode normalization is used, the line-by-line is faster, likely because:
a) using u8_normalize on the entire buffer instead of uninorm_filter streaming,
b) an additional wctomb is required for each output character when using
uninorm-filter streaming.
It's likely the buffer/streaming implementation could be improved.
However, there's an issue with uninorm-filter:
The functions (e.g. uninorm_filter_write in gnulib's uninorm.h) use 'ucs4_t',
but mbrtowc/wctomb use 'wchar_t'.
Is there a guarantee that wchar_t is actually a unicode code-point? (I couldn't
find one.)
Currently the code assumes that they are one and the same.
If that's incorrect assumption, additional conversion will be needed.
The following commands can be used to compare implementations ('-S' uses the
buffer/stream method instead of line-by-line):
Short lines, with/out normalization:
yes a | head -n 10M > data1
env time ./src/unorm < data1 > /dev/null
env time ./src/unorm -S < data1 > /dev/null
env time ./src/unorm -nfkc < data1 > /dev/null
env time ./src/unorm -nfkc -S < data1 > /dev/null
Long lines, with/out normalization:
yes | perl -npe '$_ = "x" x int(rand(10000)) . "\n"' | head -n 50K > data2
env time ./src/unorm < data2 > /dev/null
env time ./src/unorm -S < data2 > /dev/null
env time ./src/unorm -nfkc < data2 > /dev/null
env time ./src/unorm -S -nfkc < data2 > /dev/null
Comments welcomed,
- assaf
unorm-2016-08-07.patch.xz
Description: Binary data
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: multibyte processing - handling invalid sequences (long),
Assaf Gordon <=