bug#18777: [PATCH] dfa: improvement for checking of multibyte character

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18777: [PATCH] dfa: improvement for checking of multibyte character

From:	Paul Eggert
Subject:	bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Date:	Mon, 15 Dec 2014 09:43:54 -0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0

On 12/15/2014 06:59 AM, Norihiro Tanaka wrote:

+/* True if each byte can not occur inside a multibyte character  */
+static bool always_single_byte[NOTCHAR];
+
+static void
+dfaalwayssb (void)
+{
+  size_t i;
+  unsigned char const uc[] = { '\0', '\n', '\r', '.', '/' };
+  for (i = 0; i < sizeof uc / sizeof uc[0]; ++i)
+    always_single_byte[uc[i]] = true;
+}

Can't we improve this when using_utf8 () is true? In that case, everyASCII character is always single byte. Also, the bytes 0xc0, 0xc1, and0xf5 through 0xff can be added to the table: they are not single-bytecharacters but they are always encoding errors so they will be acharacter boundary as far as skip_remains_mb is concerned. Thissuggests that the table 'always_single_byte' should be renamed tosomething like 'always_character_boundary'.

    wint_t wc = WEOF;
+  if (always_single_byte[*p])
+    return p;

This won't assign anything to *WCP, contrary to the documented API forfor skip_remains_mb. This is OK (as callers don't care) but the APIdocumentation should be changed to reflect the actual behavior.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/12/15
- bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Paul Eggert <=
  - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/12/16
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Paul Eggert, 2014/12/16
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/12/16
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Paul Eggert, 2014/12/16
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/12/17
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Paul Eggert, 2014/12/17
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/12/17
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Paul Eggert, 2014/12/18
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/12/18

Prev by Date: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Next by Date: bug#19388: grep 2.21-1 identifies iso encoded text files as binary
Previous by thread: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Next by thread: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Index(es):
- Date
- Thread