[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mbcel module for Gnulib?, incomplete multibyte sequences
From: |
Paul Eggert |
Subject: |
Re: mbcel module for Gnulib?, incomplete multibyte sequences |
Date: |
Fri, 21 Jul 2023 16:25:17 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 |
On 2023-07-20 08:28, Bruno Haible wrote:
e.g., it worries about mbrtoc32 returning (size_t) -3,
This makes only for a small performance difference.
Yes, if done right, it matters only when input contains encoding errors.
or returning (size_t) -1 in the C locale.
Indeed, this shows as a difference between mbiterf and mbcel in the
test cases c, f:
mbiterf mbcel mbuiterf
c 1.145 0.670 1.179
f 13.028 5.714 14.654
But since the glibc people are already working on resolving this issue,
Although that'll help quite a bit, it won't be enough to overcome the
performance overhead. mbiter, mbiterf, mbuiter, and mbuiterf (which I'll
shorten to "mbu?iterf?") carry around more stuff, and it'll likely be
significant overhead even after the issue is resolved. For now, anyway,
the overhead is large enough that mbcel is a winner for diff.
Even mbcel strains GCC's capabilities to optimize. When I look at the
x86-64 code it generates I see clear opportunities for tighter code. If
I have time I'll fire off a suggestion or two to the GCC maintainers.
With luck these suggestions would also help mbu?iterf?.
The mbu?iterf? overhead would be OK if it bought something for diff. But
it doesn't. And I suspect we'll see something similar in other GNU code
that has held off on full support for multi-byte encodings due to
worries about complexity and performance.
The issue here is not just performance, but complexity. See the attached
patch, which I just installed into diffutils. It has a function
mbcel_strcasecmp which has the same behavior as Gnulib's mbuiterf-using
mbscasecmp, and with multi-byte source code that is significantly
simpler than the mbuiterf code. Here's the mbcel version:
while (true)
{
mbcel_t g1 = mbcel_scanz (p1); p1 += g1.len;
mbcel_t g2 = mbcel_scanz (p2); p2 += g2.len;
int cmp = mbcel_casecmp (g1, g2);
if (_GL_UNLIKELY (cmp | ! (g1.ch | g1.err)))
return cmp;
}
and the mbuiterf equivalent:
mbuif_state_t state1;
const char *iter1;
mbuif_init (state1);
iter1 = s1;
mbuif_state_t state2;
const char *iter2;
mbuif_init (state2);
iter2 = s2;
while (mbuif_avail (state1, iter1) && mbuif_avail (state2, iter2))
{
mbchar_t cur1 = mbuif_next (state1, iter1);
mbchar_t cur2 = mbuif_next (state2, iter2);
int cmp = mb_casecmp (cur1, cur2);
if (cmp != 0)
return cmp;
iter1 += mb_len (cur1);
iter2 += mb_len (cur2);
}
if (mbuif_avail (state1, iter1))
/* s2 terminated before s1. */
return 1;
if (mbuif_avail (state2, iter2))
/* s1 terminated before s2. */
return -1;
return 0;
On my platform the machine code for mbcel_strcasecmp (including its
single-byte code) is 676 bytes, whereas mbscasecmp's contains 1230 bytes
(1055 bytes if compiled with NDEBUG). And here is the list of functions
that the two implementations call, along with the static number of calls
(C for mbcel, I for mbuiterf, J for mbuiterf with NDEBUG):
C I J
3 __assert_fail
1 3 3 __ctype_get_mb_cur_max
1 1 1 __ctype_to_lower_loc
2 3 3 mbrtoc32 / rpl_mbrtoc32
4 2 mbsinit
3 3 memcmp
2 2 strlen
3 3 strnlen1
2 2 2 towlower
Although one could tune the mbuiterf version to make it simpler and
faster (and I'm mentioning the above functions to help that effort, if
you want to take the time), I don't see how it ever could match the
mcbel version, given its more-complex API.
Since the two versions have the same behavior and the mbcel version is
simpler and faster, I'd like to replace mbscasecmp's mbuiterf version
with the mbcel version in Gnulib. If that can't be done for some reason,
then let's have a switch to let the Gnulib user pick one implementation
or the other.
The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,
This isn't a problem for programs like grep and diff, where there's
always a newline at the end the input buffer.
I disagree: Any program can run into it when the input is
<some valid UTF-8 characters><an incomplete UTF-8 character><newline>
The newline is part of the string - at least, that's how grep and diff
do things.
My screenshot from the 'src/diff -y -t' output in an xterm also shows
that there is an issue.
That's an issue of display; it's not an issue of how grep scans or
outputs the data. In code that simply scans, this issue does not matter.
For example, mbscasecmp behaves the same regardless of whether mbcel or
mbuiterf is used.
Sure. Some programs then treat that error as if an U+FFFD character
had been read.
- ISO 10646 says ([1] section R.7) "it shall interpret that malformed
sequence in the same way that it interprets a character that is outside
the adopted subset".
If I understand this requirement correctly mbcel satisfies it, as mbcel
treats those two things in the same way, namely, as sequences of
encoding error bytes.
No, I don't think mbcel satisfies it, since mbcel interprets the
"malformed sequence" not like "a character" but like multiple characters.
No, mbcel interprets it as a malformed byte sequence. It's not a
character, nor is it multiple characters. It is a sequence of encoding
error bytes, and it's up to the caller to deal with that sequence as it
chooses.
diff uses mbcel to output the malformed sequence as-is. That's how diff
signals encoding errors. This is common practice, not just for diff but
for many other applications. Of course this method has issues, just like
any method of signaling encoding errors does, but it's simple, it
doesn't lose information, it's a local sweet spot in how to deal with
encoding errors, and I don't see any serious suggestion to change how
diff behaves in this respect. Since mbcel supports this usage, diff
doesn't need the extra features provided by mbu?iterf?.
The Unicode Standard has several pages about this topic:
Unicode 15.0 section 3.9
https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf pages 124..129
That section is about converters. mbcel and mbiter are not converters:
they're scanners. You can use mbcel (or mbiter) to build a converter
that conforms to the standard in any way you like. You can also use
either of them in code that does not convert (e.g., mbscasecmp). So this
disagreement is not about mbcel itself, but how mbcel is used as part of
a larger application.
diff also is not a converter in the sense of the standard. It does not
convert UTF-8 text to a different encoding.
Also, the standard explicitly allows converters to treat each
encoding-error byte as a separate unit. It gives this as an example on
page 127, where it says "... in processing the UTF-8 code unit sequence
<F0 80 80 41>.... The converter could return ... <U+FFFD, U+FFFD,
U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error"
among other possible approaches.
(xterm does it right, but you are right that nowadays gnome-terminal and other
vte-based terminal emulators are the majority.)
xterm does not do it "right", at least, not on my platform (Ubuntu
23.04). It does not behave as Kuhn suggested in the last screenful of
Kuhn's example, as I described in a previous email.
If you were to use the 'mbiterf' module instead of mbcel, the
mb_equal macro from mbchar.h does the right thing. Yes, an mb_equal
call is a bit more complicated than the same_ch_err definition that
you have in diffutils/src/io.c. That's the unavoidable consequence of
treating a sequence of 2 or 3 bytes as *one* error.
OK, thanks for letting me know mbiterf can handle the situation.
What does mbiterf do in non-UTF-8 multi-byte locales? How can it tell
how long the invalid sequence is?
According to what I read in the Unicode Standard (above), it's a best practice
for all kinds of applications.
The Unicode standard doesn't say it's best practice. And even if it were
best practice for UTF-8 (which is dubious) I don't see see how it
generalizes to non-UTF-8 multi-byte encodings.
Admittedly these encodings are less important. I had even considered
dropping support for them in GNU diff, as that would simplify
maintenance and improve performance. However, that might be a bridge too
far for a few traditionalist users.
0001-diff-sort-multi-byte-file-names-better.patch
Description: Text Data
- Re: From wchar_t to char32_t, (continued)
- Re: From wchar_t to char32_t, Paul Eggert, 2023/07/06
- mbcel module for Gnulib?, Paul Eggert, 2023/07/09
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/11
- Re: mbcel module for Gnulib?, Paul Eggert, 2023/07/12
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/13
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/16
- Re: mbcel module for Gnulib?, Bruno Haible, 2023/07/20
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/16
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/17
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/20
- Re: mbcel module for Gnulib?, incomplete multibyte sequences,
Paul Eggert <=
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/21
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/25
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/22
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Bruno Haible, 2023/07/24
- Re: mbcel module for Gnulib?, incomplete multibyte sequences, Paul Eggert, 2023/07/27