grep-commit
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep branch, master, updated. v3.6-4-g192e599


From: Jim Meyering
Subject: grep branch, master, updated. v3.6-4-g192e599
Date: Fri, 27 Nov 2020 02:40:00 -0500 (EST)

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".

The branch, master has been updated
       via  192e59903c7d313bb47de3d5c15b3dc634e98c5f (commit)
       via  9095818f2669ceed97c41d72c5269f60cbfa9260 (commit)
       via  623008c296577495e00325bd3640a1b0051fa044 (commit)
      from  fac92da8bce6544582e76a720fe0787c0e0b860a (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=192e59903c7d313bb47de3d5c15b3dc634e98c5f


commit 192e59903c7d313bb47de3d5c15b3dc634e98c5f
Author: Jim Meyering <meyering@fb.com>
Date:   Wed Nov 25 16:49:51 2020 -0800

    grep: avoid performance regression with many patterns
    
    * src/grep.c (hash_pattern): Switch from PJW to DJB2, to avoid an
    O(N) to O(N^2) performance regression due to hash collisions with
    patterns from e.g., seq 500000|tr 0-9 A-J
    Reported by Frank Heckenbach in https://bugs.gnu.org/44754
    * NEWS (Bug fixes): Mention it.
    * tests/hash-collision-perf: New file.
    * tests/Makefile.am (TESTS): Add it.

diff --git a/NEWS b/NEWS
index a45ced1..508759a 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,13 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Bug fixes
+
+  Preprocessing N patterns would take at least O(N^2) time when too many
+  patterns hashed to too few buckets. This now takes seconds, not days:
+  : | grep -Ff <(seq 6400000 | tr 0-9 A-J)
+  [Bug#44754 introduced in grep 3.5]
+
 
 * Noteworthy changes in release 3.6 (2020-11-08) [stable]
 
diff --git a/src/grep.c b/src/grep.c
index eb7c60e..5a58bf5 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -128,8 +128,9 @@ hash_pattern (void const *pat, size_t n_buckets)
 {
   size_t h = 0;
   intptr_t pat_offset = (intptr_t) pat - 1;
-  for (char const *s = pattern_array + pat_offset; *s != '\n'; s++)
-    h = *s + ((h << 9) | (h >> (SIZE_WIDTH - 9)));
+  unsigned char const *s = (unsigned char const *) pattern_array + pat_offset;
+  for ( ; *s != '\n'; s++)
+    h = h * 33 ^ *s;
   return h % n_buckets;
 }
 static bool _GL_ATTRIBUTE_PURE
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 480bfb4..b1b5d61 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -109,6 +109,7 @@ TESTS =                                             \
   grep-dev-null                                        \
   grep-dev-null-out                            \
   grep-dir                                     \
+  hash-collision-perf                          \
   help-version                                 \
   high-bit-range                               \
   in-eq-out-infloop                            \
diff --git a/tests/hash-collision-perf b/tests/hash-collision-perf
new file mode 100755
index 0000000..38c876d
--- /dev/null
+++ b/tests/hash-collision-perf
@@ -0,0 +1,53 @@
+#!/bin/sh
+# Test for this performance regression:
+# grep-3.5 and 3.6 would take O(N^2) time for some sets of input regexps.
+
+# Copyright 2020 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+: > empty || framework_failure_
+
+# Construct a test case that consumes enough CPU time that we don't
+# have to worry about measurement noise. This first case is searching
+# for digits, which never exhibited a problem with hash collisions.
+n_pat=40000
+while :; do
+  seq $n_pat > in || framework_failure_
+  small_ms=$(LC_ALL=C user_time_ 1 grep --file=in empty) || fail=1
+  test $small_ms -ge 200 && break
+  n_pat=$(expr $n_pat '*' 2)
+done
+
+# Now, search for those same digits mapped to A-J.
+# With the PJW-based hash function, this became O(N^2).
+seq $n_pat | tr 0-9 A-J > in || framework_failure_
+large_ms=$(LC_ALL=C user_time_ 1 grep --file=in empty) || fail=1
+
+# Deliberately recording in an unused variable so it
+# shows up in set -x output, in case this test fails.
+ratio=$(expr "$large_ms" / "$small_ms")
+warn_ ratio=$ratio
+
+# The duration of the latter run must be no more than 10 times
+# that of the former.  Using recent versions prior to this fix,
+# this test would fail due to ratios > 800.  Using the fixed version,
+# it's common to see a ratio less than 1.
+returns_ 1 expr $small_ms '<' $large_ms / 10 || fail=1
+
+Exit $fail

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=9095818f2669ceed97c41d72c5269f60cbfa9260


commit 192e59903c7d313bb47de3d5c15b3dc634e98c5f
Author: Jim Meyering <meyering@fb.com>
Date:   Wed Nov 25 16:49:51 2020 -0800

    grep: avoid performance regression with many patterns
    
    * src/grep.c (hash_pattern): Switch from PJW to DJB2, to avoid an
    O(N) to O(N^2) performance regression due to hash collisions with
    patterns from e.g., seq 500000|tr 0-9 A-J
    Reported by Frank Heckenbach in https://bugs.gnu.org/44754
    * NEWS (Bug fixes): Mention it.
    * tests/hash-collision-perf: New file.
    * tests/Makefile.am (TESTS): Add it.

diff --git a/NEWS b/NEWS
index a45ced1..508759a 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,13 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Bug fixes
+
+  Preprocessing N patterns would take at least O(N^2) time when too many
+  patterns hashed to too few buckets. This now takes seconds, not days:
+  : | grep -Ff <(seq 6400000 | tr 0-9 A-J)
+  [Bug#44754 introduced in grep 3.5]
+
 
 * Noteworthy changes in release 3.6 (2020-11-08) [stable]
 
diff --git a/src/grep.c b/src/grep.c
index eb7c60e..5a58bf5 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -128,8 +128,9 @@ hash_pattern (void const *pat, size_t n_buckets)
 {
   size_t h = 0;
   intptr_t pat_offset = (intptr_t) pat - 1;
-  for (char const *s = pattern_array + pat_offset; *s != '\n'; s++)
-    h = *s + ((h << 9) | (h >> (SIZE_WIDTH - 9)));
+  unsigned char const *s = (unsigned char const *) pattern_array + pat_offset;
+  for ( ; *s != '\n'; s++)
+    h = h * 33 ^ *s;
   return h % n_buckets;
 }
 static bool _GL_ATTRIBUTE_PURE
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 480bfb4..b1b5d61 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -109,6 +109,7 @@ TESTS =                                             \
   grep-dev-null                                        \
   grep-dev-null-out                            \
   grep-dir                                     \
+  hash-collision-perf                          \
   help-version                                 \
   high-bit-range                               \
   in-eq-out-infloop                            \
diff --git a/tests/hash-collision-perf b/tests/hash-collision-perf
new file mode 100755
index 0000000..38c876d
--- /dev/null
+++ b/tests/hash-collision-perf
@@ -0,0 +1,53 @@
+#!/bin/sh
+# Test for this performance regression:
+# grep-3.5 and 3.6 would take O(N^2) time for some sets of input regexps.
+
+# Copyright 2020 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+: > empty || framework_failure_
+
+# Construct a test case that consumes enough CPU time that we don't
+# have to worry about measurement noise. This first case is searching
+# for digits, which never exhibited a problem with hash collisions.
+n_pat=40000
+while :; do
+  seq $n_pat > in || framework_failure_
+  small_ms=$(LC_ALL=C user_time_ 1 grep --file=in empty) || fail=1
+  test $small_ms -ge 200 && break
+  n_pat=$(expr $n_pat '*' 2)
+done
+
+# Now, search for those same digits mapped to A-J.
+# With the PJW-based hash function, this became O(N^2).
+seq $n_pat | tr 0-9 A-J > in || framework_failure_
+large_ms=$(LC_ALL=C user_time_ 1 grep --file=in empty) || fail=1
+
+# Deliberately recording in an unused variable so it
+# shows up in set -x output, in case this test fails.
+ratio=$(expr "$large_ms" / "$small_ms")
+warn_ ratio=$ratio
+
+# The duration of the latter run must be no more than 10 times
+# that of the former.  Using recent versions prior to this fix,
+# this test would fail due to ratios > 800.  Using the fixed version,
+# it's common to see a ratio less than 1.
+returns_ 1 expr $small_ms '<' $large_ms / 10 || fail=1
+
+Exit $fail

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=623008c296577495e00325bd3640a1b0051fa044


commit 192e59903c7d313bb47de3d5c15b3dc634e98c5f
Author: Jim Meyering <meyering@fb.com>
Date:   Wed Nov 25 16:49:51 2020 -0800

    grep: avoid performance regression with many patterns
    
    * src/grep.c (hash_pattern): Switch from PJW to DJB2, to avoid an
    O(N) to O(N^2) performance regression due to hash collisions with
    patterns from e.g., seq 500000|tr 0-9 A-J
    Reported by Frank Heckenbach in https://bugs.gnu.org/44754
    * NEWS (Bug fixes): Mention it.
    * tests/hash-collision-perf: New file.
    * tests/Makefile.am (TESTS): Add it.

diff --git a/NEWS b/NEWS
index a45ced1..508759a 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,13 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Bug fixes
+
+  Preprocessing N patterns would take at least O(N^2) time when too many
+  patterns hashed to too few buckets. This now takes seconds, not days:
+  : | grep -Ff <(seq 6400000 | tr 0-9 A-J)
+  [Bug#44754 introduced in grep 3.5]
+
 
 * Noteworthy changes in release 3.6 (2020-11-08) [stable]
 
diff --git a/src/grep.c b/src/grep.c
index eb7c60e..5a58bf5 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -128,8 +128,9 @@ hash_pattern (void const *pat, size_t n_buckets)
 {
   size_t h = 0;
   intptr_t pat_offset = (intptr_t) pat - 1;
-  for (char const *s = pattern_array + pat_offset; *s != '\n'; s++)
-    h = *s + ((h << 9) | (h >> (SIZE_WIDTH - 9)));
+  unsigned char const *s = (unsigned char const *) pattern_array + pat_offset;
+  for ( ; *s != '\n'; s++)
+    h = h * 33 ^ *s;
   return h % n_buckets;
 }
 static bool _GL_ATTRIBUTE_PURE
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 480bfb4..b1b5d61 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -109,6 +109,7 @@ TESTS =                                             \
   grep-dev-null                                        \
   grep-dev-null-out                            \
   grep-dir                                     \
+  hash-collision-perf                          \
   help-version                                 \
   high-bit-range                               \
   in-eq-out-infloop                            \
diff --git a/tests/hash-collision-perf b/tests/hash-collision-perf
new file mode 100755
index 0000000..38c876d
--- /dev/null
+++ b/tests/hash-collision-perf
@@ -0,0 +1,53 @@
+#!/bin/sh
+# Test for this performance regression:
+# grep-3.5 and 3.6 would take O(N^2) time for some sets of input regexps.
+
+# Copyright 2020 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+: > empty || framework_failure_
+
+# Construct a test case that consumes enough CPU time that we don't
+# have to worry about measurement noise. This first case is searching
+# for digits, which never exhibited a problem with hash collisions.
+n_pat=40000
+while :; do
+  seq $n_pat > in || framework_failure_
+  small_ms=$(LC_ALL=C user_time_ 1 grep --file=in empty) || fail=1
+  test $small_ms -ge 200 && break
+  n_pat=$(expr $n_pat '*' 2)
+done
+
+# Now, search for those same digits mapped to A-J.
+# With the PJW-based hash function, this became O(N^2).
+seq $n_pat | tr 0-9 A-J > in || framework_failure_
+large_ms=$(LC_ALL=C user_time_ 1 grep --file=in empty) || fail=1
+
+# Deliberately recording in an unused variable so it
+# shows up in set -x output, in case this test fails.
+ratio=$(expr "$large_ms" / "$small_ms")
+warn_ ratio=$ratio
+
+# The duration of the latter run must be no more than 10 times
+# that of the former.  Using recent versions prior to this fix,
+# this test would fail due to ratios > 800.  Using the fixed version,
+# it's common to see a ratio less than 1.
+returns_ 1 expr $small_ms '<' $large_ms / 10 || fail=1
+
+Exit $fail

-----------------------------------------------------------------------

Summary of changes:
 NEWS                      |  7 +++++++
 configure.ac              | 14 +++++++++++++
 gnulib                    |  2 +-
 gnulib-tests/Makefile.am  |  3 +++
 src/grep.c                |  7 ++++---
 tests/Makefile.am         |  1 +
 tests/hash-collision-perf | 53 +++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 83 insertions(+), 4 deletions(-)
 create mode 100755 tests/hash-collision-perf


hooks/post-receive
-- 
grep



reply via email to

[Prev in Thread] Current Thread [Next in Thread]