bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Paul Eggert
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Fri, 21 Jul 2023 16:25:17 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-20 08:28, Bruno Haible wrote:

e.g., it worries about mbrtoc32 returning (size_t) -3,

This makes only for a small performance difference.

Yes, if done right, it matters only when input contains encoding errors.


or returning (size_t) -1 in the C locale.

Indeed, this shows as a difference between mbiterf and mbcel in the
test cases c, f:

     mbiterf mbcel   mbuiterf
c    1.145  0.670    1.179
f   13.028  5.714   14.654

But since the glibc people are already working on resolving this issue,

Although that'll help quite a bit, it won't be enough to overcome the performance overhead. mbiter, mbiterf, mbuiter, and mbuiterf (which I'll shorten to "mbu?iterf?") carry around more stuff, and it'll likely be significant overhead even after the issue is resolved. For now, anyway, the overhead is large enough that mbcel is a winner for diff.

Even mbcel strains GCC's capabilities to optimize. When I look at the x86-64 code it generates I see clear opportunities for tighter code. If I have time I'll fire off a suggestion or two to the GCC maintainers. With luck these suggestions would also help mbu?iterf?.

The mbu?iterf? overhead would be OK if it bought something for diff. But it doesn't. And I suspect we'll see something similar in other GNU code that has held off on full support for multi-byte encodings due to worries about complexity and performance.

The issue here is not just performance, but complexity. See the attached patch, which I just installed into diffutils. It has a function mbcel_strcasecmp which has the same behavior as Gnulib's mbuiterf-using mbscasecmp, and with multi-byte source code that is significantly simpler than the mbuiterf code. Here's the mbcel version:

    while (true)
      {
        mbcel_t g1 = mbcel_scanz (p1); p1 += g1.len;
        mbcel_t g2 = mbcel_scanz (p2); p2 += g2.len;
        int cmp = mbcel_casecmp (g1, g2);
        if (_GL_UNLIKELY (cmp | ! (g1.ch | g1.err)))
          return cmp;
      }

and the mbuiterf equivalent:

      mbuif_state_t state1;
      const char *iter1;
      mbuif_init (state1);
      iter1 = s1;

      mbuif_state_t state2;
      const char *iter2;
      mbuif_init (state2);
      iter2 = s2;

      while (mbuif_avail (state1, iter1) && mbuif_avail (state2, iter2))
        {
          mbchar_t cur1 = mbuif_next (state1, iter1);
          mbchar_t cur2 = mbuif_next (state2, iter2);
          int cmp = mb_casecmp (cur1, cur2);

          if (cmp != 0)
            return cmp;

          iter1 += mb_len (cur1);
          iter2 += mb_len (cur2);
        }
      if (mbuif_avail (state1, iter1))
        /* s2 terminated before s1.  */
        return 1;
      if (mbuif_avail (state2, iter2))
        /* s1 terminated before s2.  */
        return -1;
      return 0;

On my platform the machine code for mbcel_strcasecmp (including its single-byte code) is 676 bytes, whereas mbscasecmp's contains 1230 bytes (1055 bytes if compiled with NDEBUG). And here is the list of functions that the two implementations call, along with the static number of calls (C for mbcel, I for mbuiterf, J for mbuiterf with NDEBUG):

   C I J
     3   __assert_fail
   1 3 3 __ctype_get_mb_cur_max
   1 1 1 __ctype_to_lower_loc
   2 3 3 mbrtoc32 / rpl_mbrtoc32
     4 2 mbsinit
     3 3 memcmp
     2 2 strlen
     3 3 strnlen1
   2 2 2 towlower

Although one could tune the mbuiterf version to make it simpler and faster (and I'm mentioning the above functions to help that effort, if you want to take the time), I don't see how it ever could match the mcbel version, given its more-complex API.

Since the two versions have the same behavior and the mbcel version is simpler and faster, I'd like to replace mbscasecmp's mbuiterf version with the mbcel version in Gnulib. If that can't be done for some reason, then let's have a switch to let the Gnulib user pick one implementation or the other.


The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,

This isn't a problem for programs like grep and diff, where there's
always a newline at the end the input buffer.

I disagree: Any program can run into it when the input is
   <some valid UTF-8 characters><an incomplete UTF-8 character><newline>

The newline is part of the string - at least, that's how grep and diff do things.


My screenshot from the 'src/diff -y -t' output in an xterm also shows
that there is an issue.

That's an issue of display; it's not an issue of how grep scans or outputs the data. In code that simply scans, this issue does not matter. For example, mbscasecmp behaves the same regardless of whether mbcel or mbuiterf is used.


Sure. Some programs then treat that error as if an U+FFFD character
had been read.

    - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
      sequence in the same way that it interprets a character that is outside
      the adopted subset".

If I understand this requirement correctly mbcel satisfies it, as mbcel
treats those two things in the same way, namely, as sequences of
encoding error bytes.

No, I don't think mbcel satisfies it, since mbcel interprets the
"malformed sequence" not like "a character" but like multiple characters.

No, mbcel interprets it as a malformed byte sequence. It's not a character, nor is it multiple characters. It is a sequence of encoding error bytes, and it's up to the caller to deal with that sequence as it chooses.

diff uses mbcel to output the malformed sequence as-is. That's how diff signals encoding errors. This is common practice, not just for diff but for many other applications. Of course this method has issues, just like any method of signaling encoding errors does, but it's simple, it doesn't lose information, it's a local sweet spot in how to deal with encoding errors, and I don't see any serious suggestion to change how diff behaves in this respect. Since mbcel supports this usage, diff doesn't need the extra features provided by mbu?iterf?.


The Unicode Standard has several pages about this topic:
Unicode 15.0 section 3.9
   https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf  pages 124..129

That section is about converters. mbcel and mbiter are not converters: they're scanners. You can use mbcel (or mbiter) to build a converter that conforms to the standard in any way you like. You can also use either of them in code that does not convert (e.g., mbscasecmp). So this disagreement is not about mbcel itself, but how mbcel is used as part of a larger application.

diff also is not a converter in the sense of the standard. It does not convert UTF-8 text to a different encoding.

Also, the standard explicitly allows converters to treat each encoding-error byte as a separate unit. It gives this as an example on page 127, where it says "... in processing the UTF-8 code unit sequence <F0 80 80 41>.... The converter could return ... <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error" among other possible approaches.


(xterm does it right, but you are right that nowadays gnome-terminal and other
vte-based terminal emulators are the majority.)

xterm does not do it "right", at least, not on my platform (Ubuntu 23.04). It does not behave as Kuhn suggested in the last screenful of Kuhn's example, as I described in a previous email.


If you were to use the 'mbiterf' module instead of mbcel, the
mb_equal macro from mbchar.h does the right thing. Yes, an mb_equal
call is a bit more complicated than the same_ch_err definition that
you have in diffutils/src/io.c. That's the unavoidable consequence of
treating a sequence of 2 or 3 bytes as *one* error.

OK, thanks for letting me know mbiterf can handle the situation.

What does mbiterf do in non-UTF-8 multi-byte locales? How can it tell how long the invalid sequence is?


According to what I read in the Unicode Standard (above), it's a best practice
for all kinds of applications.

The Unicode standard doesn't say it's best practice. And even if it were best practice for UTF-8 (which is dubious) I don't see see how it generalizes to non-UTF-8 multi-byte encodings.

Admittedly these encodings are less important. I had even considered dropping support for them in GNU diff, as that would simplify maintenance and improve performance. However, that might be a bridge too far for a few traditionalist users.

Attachment: 0001-diff-sort-multi-byte-file-names-better.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]