bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] dfa: optimize UTF-8 period


From: Jim Meyering
Subject: Re: [PATCH v2] dfa: optimize UTF-8 period
Date: Tue, 20 Apr 2010 12:06:05 +0200

Paolo Bonzini wrote:
> On 04/20/2010 12:47 AM, Eric Blake wrote:
>> On 04/19/2010 06:14 AM, Paolo Bonzini wrote:
>>> +  /* A valid UTF-8 character is
>>> +
>>> +          ([0x00-0x7f]
>>> +           |[0xc2-0xdf][0x80-0xbf]
>>> +           |[0xe0-0xef[0x80-0xbf][0x80-0xbf]
>>> +           |[0xf0-f7][0x80-0xbf][0x80-0xbf][0x80-0xbf])
>>
>> Yes, but in POSIX XBD 9.3.4,
>> http://www.opengroup.org/onlinepubs/9699919799/toc.htm, the ANYCHAR does
>> not match NUL.  Do you need to adjust this patch to exclude 0x00?

Good catch, Eric.
Note that GNU grep for "." appears always to have matched NUL
and that Solaris-10's does not.

> Yes (following the syntax bits).
>
> Does this seem okay?
>
> Paolo
>
> diff --git a/gnulib b/gnulib
> index 5fbd6e3..bfffe40 160000
> --- a/gnulib
> +++ b/gnulib
> @@ -1 +1 @@
> -Subproject commit 5fbd6e3e571c6e59270fa486bd7c83dfe04c87cf
> +Subproject commit bfffe408f8b375fd0989266bd8c01580be26d1a8
> diff --git a/src/dfa.c b/src/dfa.c
> index 61322d1..d9c5ba2 100644
> --- a/src/dfa.c
> +++ b/src/dfa.c
> @@ -1487,7 +1487,17 @@ add_utf8_anychar (void)
>    /* Define the five character classes that are needed below.  */
>    if (dfa->utf8_anychar_classes[0] == 0)
>      for (i = 0; i < n; i++)
> -      dfa->utf8_anychar_classes[i] = CSET + charclass_index(utf8_classes[i]);
> +      {
> +        charclass c = utf8_classes[i];
> +        if (i == 1)
> +          {
> +            if (!(syntax_bits & RE_DOT_NEWLINE))
> +              clrbit (c, eolbyte);
> +            if (syntax_bits & RE_DOT_NOT_NULL)
> +              clrbit (c, '\0');
> +          }
> +        dfa->utf8_anychar_classes[i] = CSET + charclass_index(c);
> +      }

Please put braces around the now-longer "if" block.
I presume you didn't intend that gnulib change.
The above didn't compile for me, so I tried this instead:

diff --git a/src/dfa.c b/src/dfa.c
index 7d39e5c..340a4c6 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1487,7 +1487,18 @@ add_utf8_anychar (void)
   /* Define the five character classes that are needed below.  */
   if (dfa->utf8_anychar_classes[0] == 0)
     for (i = 0; i < n; i++)
-      dfa->utf8_anychar_classes[i] = CSET + charclass_index(utf8_classes[i]);
+      {
+        charclass c;
+        memcpy (c, utf8_classes[i], sizeof c);
+        if (i == 1)
+          {
+            if (!(syntax_bits & RE_DOT_NEWLINE))
+              clrbit (eolbyte, c);
+            if (syntax_bits & RE_DOT_NOT_NULL)
+              clrbit ('\0', c);
+          }
+        dfa->utf8_anychar_classes[i] = CSET + charclass_index(c);
+      }

   /* A valid UTF-8 character is

But even with that, each of these still matches:

  printf '\n'|LC_ALL=en_US.utf8 src/grep -zl .
  printf '\0'|LC_ALL=en_US.utf8 src/grep -l .

They should fail.

Here's a test addition, for once this is fixed:

>From 80c0babf19e5c72322bd3c86b80985121b430c30 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Tue, 20 Apr 2010 11:34:57 +0200
Subject: [PATCH] tests: ensure "." matches neither newline nor NUL

* tests/dot-vs-NUL-and-NL: New file.
* tests/Makefile.am (TESTS): Add it.
---
 tests/Makefile.am       |    1 +
 tests/dot-vs-NUL-and-NL |   26 ++++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 0 deletions(-)
 create mode 100644 tests/dot-vs-NUL-and-NL

diff --git a/tests/Makefile.am b/tests/Makefile.am
index fae2c85..86a35c1 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -37,6 +37,7 @@ TESTS =                                               \
   case-fold-char-type                          \
   char-class-multibyte                         \
   dfaexec-multibyte                            \
+  dot-vs-NUL-and-NL                            \
   empty                                                \
   ere.sh                                       \
   euc-mb                                       \
diff --git a/tests/dot-vs-NUL-and-NL b/tests/dot-vs-NUL-and-NL
new file mode 100644
index 0000000..c737e33
--- /dev/null
+++ b/tests/dot-vs-NUL-and-NL
@@ -0,0 +1,26 @@
+#!/bin/sh
+# Ensure that the match-any "." pattern does not match "\0" or "\n".
+: ${srcdir=.}
+. "$srcdir/init.sh"; path_prepend_ ../src
+
+require_en_utf8_locale_
+
+printf '\n' > nl || framework_failure_
+printf '\0' > nul || framework_failure_
+fail=0
+
+for loc in en_US.UTF-8 C; do
+
+  LC_ALL=$loc grep -zl . nl > out 2>&1
+  # Expect no match and no output.
+  test $? = 1 || fail=1
+  compare out /dev/null || fail=1
+
+  LC_ALL=$loc grep -l . nul > out 2>&1
+  # Expect no match and no output.
+  test $? = 1 || fail=1
+  compare out /dev/null || fail=1
+
+done
+
+Exit $fail
--
1.7.1.rc1.248.gcefbb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]