Re: mbcel module for Gnulib?, incomplete multibyte sequences

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences

From:	Paul Eggert
Subject:	Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date:	Fri, 21 Jul 2023 16:25:17 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0

On 2023-07-20 08:28, Bruno Haible wrote:

e.g., it worries about mbrtoc32 returning (size_t) -3,


This makes only for a small performance difference.


Yes, if done right, it matters only when input contains encoding errors.

or returning (size_t) -1 in the C locale.


Indeed, this shows as a difference between mbiterf and mbcel in the
test cases c, f:

     mbiterf mbcel   mbuiterf
c    1.145  0.670    1.179
f   13.028  5.714   14.654

But since the glibc people are already working on resolving this issue,

Although that'll help quite a bit, it won't be enough to overcome theperformance overhead. mbiter, mbiterf, mbuiter, and mbuiterf (which I'llshorten to "mbu?iterf?") carry around more stuff, and it'll likely besignificant overhead even after the issue is resolved. For now, anyway,the overhead is large enough that mbcel is a winner for diff.

Even mbcel strains GCC's capabilities to optimize. When I look at thex86-64 code it generates I see clear opportunities for tighter code. IfI have time I'll fire off a suggestion or two to the GCC maintainers.With luck these suggestions would also help mbu?iterf?.

The mbu?iterf? overhead would be OK if it bought something for diff. Butit doesn't. And I suspect we'll see something similar in other GNU codethat has held off on full support for multi-byte encodings due toworries about complexity and performance.

The issue here is not just performance, but complexity. See the attachedpatch, which I just installed into diffutils. It has a functionmbcel_strcasecmp which has the same behavior as Gnulib's mbuiterf-usingmbscasecmp, and with multi-byte source code that is significantlysimpler than the mbuiterf code. Here's the mbcel version:


    while (true)
      {
        mbcel_t g1 = mbcel_scanz (p1); p1 += g1.len;
        mbcel_t g2 = mbcel_scanz (p2); p2 += g2.len;
        int cmp = mbcel_casecmp (g1, g2);
        if (_GL_UNLIKELY (cmp | ! (g1.ch | g1.err)))
          return cmp;
      }

and the mbuiterf equivalent:

      mbuif_state_t state1;
      const char *iter1;
      mbuif_init (state1);
      iter1 = s1;

      mbuif_state_t state2;
      const char *iter2;
      mbuif_init (state2);
      iter2 = s2;

      while (mbuif_avail (state1, iter1) && mbuif_avail (state2, iter2))
        {
          mbchar_t cur1 = mbuif_next (state1, iter1);
          mbchar_t cur2 = mbuif_next (state2, iter2);
          int cmp = mb_casecmp (cur1, cur2);

          if (cmp != 0)
            return cmp;

          iter1 += mb_len (cur1);
          iter2 += mb_len (cur2);
        }
      if (mbuif_avail (state1, iter1))
        /* s2 terminated before s1.  */
        return 1;
      if (mbuif_avail (state2, iter2))
        /* s1 terminated before s2.  */
        return -1;
      return 0;

On my platform the machine code for mbcel_strcasecmp (including itssingle-byte code) is 676 bytes, whereas mbscasecmp's contains 1230 bytes(1055 bytes if compiled with NDEBUG). And here is the list of functionsthat the two implementations call, along with the static number of calls(C for mbcel, I for mbuiterf, J for mbuiterf with NDEBUG):


   C I J
     3   __assert_fail
   1 3 3 __ctype_get_mb_cur_max
   1 1 1 __ctype_to_lower_loc
   2 3 3 mbrtoc32 / rpl_mbrtoc32
     4 2 mbsinit
     3 3 memcmp
     2 2 strlen
     3 3 strnlen1
   2 2 2 towlower

Although one could tune the mbuiterf version to make it simpler andfaster (and I'm mentioning the above functions to help that effort, ifyou want to take the time), I don't see how it ever could match themcbel version, given its more-complex API.

Since the two versions have the same behavior and the mbcel version issimpler and faster, I'd like to replace mbscasecmp's mbuiterf versionwith the mbcel version in Gnulib. If that can't be done for some reason,then let's have a switch to let the Gnulib user pick one implementationor the other.

The other significant difference that I see is the handling of multibyte
sequences. When there 2 or 3 bytes (of, say, UTF-8) that constitute an
incomplete multibyte character at the end of the string,


This isn't a problem for programs like grep and diff, where there's
always a newline at the end the input buffer.


I disagree: Any program can run into it when the input is
   <some valid UTF-8 characters><an incomplete UTF-8 character><newline>

The newline is part of the string - at least, that's how grep and diffdo things.

My screenshot from the 'src/diff -y -t' output in an xterm also shows
that there is an issue.

That's an issue of display; it's not an issue of how grep scans oroutputs the data. In code that simply scans, this issue does not matter.For example, mbscasecmp behaves the same regardless of whether mbcel ormbuiterf is used.

Sure. Some programs then treat that error as if an U+FFFD character
had been read.

    - ISO 10646 says ([1] section R.7) "it shall interpret that malformed
      sequence in the same way that it interprets a character that is outside
      the adopted subset".


If I understand this requirement correctly mbcel satisfies it, as mbcel
treats those two things in the same way, namely, as sequences of
encoding error bytes.


No, I don't think mbcel satisfies it, since mbcel interprets the
"malformed sequence" not like "a character" but like multiple characters.

No, mbcel interprets it as a malformed byte sequence. It's not acharacter, nor is it multiple characters. It is a sequence of encodingerror bytes, and it's up to the caller to deal with that sequence as itchooses.

diff uses mbcel to output the malformed sequence as-is. That's how diffsignals encoding errors. This is common practice, not just for diff butfor many other applications. Of course this method has issues, just likeany method of signaling encoding errors does, but it's simple, itdoesn't lose information, it's a local sweet spot in how to deal withencoding errors, and I don't see any serious suggestion to change howdiff behaves in this respect. Since mbcel supports this usage, diffdoesn't need the extra features provided by mbu?iterf?.

The Unicode Standard has several pages about this topic:
Unicode 15.0 section 3.9
   https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf  pages 124..129

That section is about converters. mbcel and mbiter are not converters:they're scanners. You can use mbcel (or mbiter) to build a converterthat conforms to the standard in any way you like. You can also useeither of them in code that does not convert (e.g., mbscasecmp). So thisdisagreement is not about mbcel itself, but how mbcel is used as part ofa larger application.

diff also is not a converter in the sense of the standard. It does notconvert UTF-8 text to a different encoding.

Also, the standard explicitly allows converters to treat eachencoding-error byte as a separate unit. It gives this as an example onpage 127, where it says "... in processing the UTF-8 code unit sequence<F0 80 80 41>.... The converter could return ... <U+FFFD, U+FFFD,U+FFFD, U+0041>, handling each byte of <F0 80 80> as a separate error"among other possible approaches.

(xterm does it right, but you are right that nowadays gnome-terminal and other
vte-based terminal emulators are the majority.)

xterm does not do it "right", at least, not on my platform (Ubuntu23.04). It does not behave as Kuhn suggested in the last screenful ofKuhn's example, as I described in a previous email.

If you were to use the 'mbiterf' module instead of mbcel, the
mb_equal macro from mbchar.h does the right thing. Yes, an mb_equal
call is a bit more complicated than the same_ch_err definition that
you have in diffutils/src/io.c. That's the unavoidable consequence of
treating a sequence of 2 or 3 bytes as *one* error.


OK, thanks for letting me know mbiterf can handle the situation.

What does mbiterf do in non-UTF-8 multi-byte locales? How can it tellhow long the invalid sequence is?

According to what I read in the Unicode Standard (above), it's a best practice
for all kinds of applications.

The Unicode standard doesn't say it's best practice. And even if it werebest practice for UTF-8 (which is dubious) I don't see see how itgeneralizes to non-UTF-8 multi-byte encodings.

Admittedly these encodings are less important. I had even considereddropping support for them in GNU diff, as that would simplifymaintenance and improve performance. However, that might be a bridge toofar for a few traditionalist users.

0001-diff-sort-multi-byte-file-names-better.patch
Description: Text Data

[Prev in Thread]

Current Thread

[Next in Thread]

Re: From wchar_t to char32_t, (continued)

Prev by Date: chown and chgrp won't change on systems without sys_lchown()
Next by Date: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Previous by thread: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Next by thread: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Index(es):
- Date
- Thread