grep branch, master, updated. v3.4-almost-38-g4a9ad19

grep-commit
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
grep branch, master, updated. v3.4-almost-38-g4a9ad19

From:	Jim Meyering
Subject:	grep branch, master, updated. v3.4-almost-38-g4a9ad19
Date:	Tue, 22 Sep 2020 11:48:36 -0400 (EDT)
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".

The branch, master has been updated
       via  4a9ad19352a014938b54203f8ec4c7e1e48c104f (commit)
       via  ae65513edc80a1b65f19264b9bed95d870602967 (commit)
       via  34ada37baac651312b0e30092c086833a7223df6 (commit)
       via  0bc552476f8c707de70289de2f3ca06294cd96c6 (commit)
      from  1444b4979dc5935b7fe1d13e76539dddbaabd242 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=4a9ad19352a014938b54203f8ec4c7e1e48c104f


commit 4a9ad19352a014938b54203f8ec4c7e1e48c104f
Author: Jim Meyering <meyering@fb.com>
Date:   Mon Sep 21 13:06:15 2020 -0700

    tests: test for many-regexp N^2 RSS regression
    
    * tests/many-regex-performance: New test for this performance
    regression.
    * tests/Makefile.am: Add it.
    * NEWS (Bug fixes): Describe it.

diff --git a/NEWS b/NEWS
index 79b9db0..442d4d2 100644
--- a/NEWS
+++ b/NEWS
@@ -39,6 +39,14 @@ GNU grep NEWS                                    -*- outline 
-*-
   A performance regression with many duplicate patterns has been fixed.
   [Bug#43040 introduced in grep 3.4]
 
+  An N^2 RSS performance regression with many patterns has been fixed
+  in common cases (no backref, and no use of -o or --color).
+  With only 80,000 lines of /usr/share/dict/linux.words, the following
+  would use 100GB of RSS and take 3 minutes. With the fix, it used less
+  than 400MB and took less than one second:
+    head -80000 /usr/share/dict/linux.words > w; grep -vf w w
+  [Bug#43527 introduced in grep 3.4]
+
 ** Build-related
 
   "make dist" builds .tar.gz files again, as they are still used in
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 9d49e40..83e7087 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -120,6 +120,7 @@ TESTS =                                             \
   kwset-abuse                                  \
   long-line-vs-2GiB-read                       \
   long-pattern-perf                            \
+  many-regex-performance                       \
   match-lines                                  \
   max-count-overread                           \
   max-count-vs-context                         \
diff --git a/tests/many-regex-performance b/tests/many-regex-performance
new file mode 100755
index 0000000..ee68a04
--- /dev/null
+++ b/tests/many-regex-performance
@@ -0,0 +1,79 @@
+#!/bin/sh
+# Test for this performance regression:
+# grep-3.4 would require O(N^2) RSS for N regexps
+# grep-3.5 requires O(N) in the most common cases.
+
+# Copyright 2020 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+# This test is susceptible to failure due to differences in
+# system load during the two test runs, so we'll mark it as
+# "expensive", making it less likely to be run by regular users.
+expensive_
+
+# Make the quick/small input large enough so that even on high-end
+# systems this first invocation takes at least 10ms of user time.
+word_list=/usr/share/dict/linux.words
+
+# If $word_list does not exist, generate an input that exibhits
+# similar performance characteristics.
+if ! test -f $word_list; then
+  # Generate data comprable to that word list.
+  # Note how all "words" start with "a", and that there is
+  # a small percentage of lines with at least one "." metachar.
+  # This requires /dev/urandom, so if it's not present, skip
+  # this test. If desperate, we could fall back to using
+  # tar+compressed lib/*.c as the data source.
+  test -r /dev/urandom \
+    || skip_ 'this system has neither word list nor working /dev/urandom'
+  word_list=word_list
+  ( echo a; cat /dev/urandom           \
+    | LC_ALL=C tr -dc 'a-zA-Z0-9_'     \
+    | head -c500000                    \
+    | sed 's/\(........\)/\1\n/g'      \
+    | sed s/rs/./                      \
+    | sed s/./a/                       \
+    | sort                             \
+  ) > $word_list
+fi
+
+n_lines=2000
+while :; do
+  sed ${n_lines}q < $word_list > in || framework_failure_
+  small_ms=$(LC_ALL=C user_time_ 1 grep --file=in -v in) || fail=1
+  test $small_ms -ge 10 && break
+  n_lines=$(expr $n_lines + 2000)
+done
+
+# Now, run it again, but with 20 times as many lines.
+n_lines=$(expr $n_lines \* 20)
+sed ${n_lines}q < $word_list > in || framework_failure_
+large_ms=$(LC_ALL=C user_time_ 1 grep --file=in -v in) || fail=1
+
+# Deliberately recording in an unused variable so it
+# shows up in set -x output, in case this test fails.
+ratio=$(expr "$large_ms" / "$small_ms")
+
+# The duration of the larger run must be no more than 60 times
+# that of the small one.  Using recent versions prior to this fix,
+# this test would fail due to ratios larger than 300.  Using the
+# fixed version, it's common to see a ratio of 20-30.
+returns_ 1 expr $small_ms '<' $large_ms / 60 || fail=1
+
+Exit $fail

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=ae65513edc80a1b65f19264b9bed95d870602967


commit 4a9ad19352a014938b54203f8ec4c7e1e48c104f
Author: Jim Meyering <meyering@fb.com>
Date:   Mon Sep 21 13:06:15 2020 -0700

    tests: test for many-regexp N^2 RSS regression
    
    * tests/many-regex-performance: New test for this performance
    regression.
    * tests/Makefile.am: Add it.
    * NEWS (Bug fixes): Describe it.

diff --git a/NEWS b/NEWS
index 79b9db0..442d4d2 100644
--- a/NEWS
+++ b/NEWS
@@ -39,6 +39,14 @@ GNU grep NEWS                                    -*- outline 
-*-
   A performance regression with many duplicate patterns has been fixed.
   [Bug#43040 introduced in grep 3.4]
 
+  An N^2 RSS performance regression with many patterns has been fixed
+  in common cases (no backref, and no use of -o or --color).
+  With only 80,000 lines of /usr/share/dict/linux.words, the following
+  would use 100GB of RSS and take 3 minutes. With the fix, it used less
+  than 400MB and took less than one second:
+    head -80000 /usr/share/dict/linux.words > w; grep -vf w w
+  [Bug#43527 introduced in grep 3.4]
+
 ** Build-related
 
   "make dist" builds .tar.gz files again, as they are still used in
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 9d49e40..83e7087 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -120,6 +120,7 @@ TESTS =                                             \
   kwset-abuse                                  \
   long-line-vs-2GiB-read                       \
   long-pattern-perf                            \
+  many-regex-performance                       \
   match-lines                                  \
   max-count-overread                           \
   max-count-vs-context                         \
diff --git a/tests/many-regex-performance b/tests/many-regex-performance
new file mode 100755
index 0000000..ee68a04
--- /dev/null
+++ b/tests/many-regex-performance
@@ -0,0 +1,79 @@
+#!/bin/sh
+# Test for this performance regression:
+# grep-3.4 would require O(N^2) RSS for N regexps
+# grep-3.5 requires O(N) in the most common cases.
+
+# Copyright 2020 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+# This test is susceptible to failure due to differences in
+# system load during the two test runs, so we'll mark it as
+# "expensive", making it less likely to be run by regular users.
+expensive_
+
+# Make the quick/small input large enough so that even on high-end
+# systems this first invocation takes at least 10ms of user time.
+word_list=/usr/share/dict/linux.words
+
+# If $word_list does not exist, generate an input that exibhits
+# similar performance characteristics.
+if ! test -f $word_list; then
+  # Generate data comprable to that word list.
+  # Note how all "words" start with "a", and that there is
+  # a small percentage of lines with at least one "." metachar.
+  # This requires /dev/urandom, so if it's not present, skip
+  # this test. If desperate, we could fall back to using
+  # tar+compressed lib/*.c as the data source.
+  test -r /dev/urandom \
+    || skip_ 'this system has neither word list nor working /dev/urandom'
+  word_list=word_list
+  ( echo a; cat /dev/urandom           \
+    | LC_ALL=C tr -dc 'a-zA-Z0-9_'     \
+    | head -c500000                    \
+    | sed 's/\(........\)/\1\n/g'      \
+    | sed s/rs/./                      \
+    | sed s/./a/                       \
+    | sort                             \
+  ) > $word_list
+fi
+
+n_lines=2000
+while :; do
+  sed ${n_lines}q < $word_list > in || framework_failure_
+  small_ms=$(LC_ALL=C user_time_ 1 grep --file=in -v in) || fail=1
+  test $small_ms -ge 10 && break
+  n_lines=$(expr $n_lines + 2000)
+done
+
+# Now, run it again, but with 20 times as many lines.
+n_lines=$(expr $n_lines \* 20)
+sed ${n_lines}q < $word_list > in || framework_failure_
+large_ms=$(LC_ALL=C user_time_ 1 grep --file=in -v in) || fail=1
+
+# Deliberately recording in an unused variable so it
+# shows up in set -x output, in case this test fails.
+ratio=$(expr "$large_ms" / "$small_ms")
+
+# The duration of the larger run must be no more than 60 times
+# that of the small one.  Using recent versions prior to this fix,
+# this test would fail due to ratios larger than 300.  Using the
+# fixed version, it's common to see a ratio of 20-30.
+returns_ 1 expr $small_ms '<' $large_ms / 60 || fail=1
+
+Exit $fail

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=34ada37baac651312b0e30092c086833a7223df6


commit 4a9ad19352a014938b54203f8ec4c7e1e48c104f
Author: Jim Meyering <meyering@fb.com>
Date:   Mon Sep 21 13:06:15 2020 -0700

    tests: test for many-regexp N^2 RSS regression
    
    * tests/many-regex-performance: New test for this performance
    regression.
    * tests/Makefile.am: Add it.
    * NEWS (Bug fixes): Describe it.

diff --git a/NEWS b/NEWS
index 79b9db0..442d4d2 100644
--- a/NEWS
+++ b/NEWS
@@ -39,6 +39,14 @@ GNU grep NEWS                                    -*- outline 
-*-
   A performance regression with many duplicate patterns has been fixed.
   [Bug#43040 introduced in grep 3.4]
 
+  An N^2 RSS performance regression with many patterns has been fixed
+  in common cases (no backref, and no use of -o or --color).
+  With only 80,000 lines of /usr/share/dict/linux.words, the following
+  would use 100GB of RSS and take 3 minutes. With the fix, it used less
+  than 400MB and took less than one second:
+    head -80000 /usr/share/dict/linux.words > w; grep -vf w w
+  [Bug#43527 introduced in grep 3.4]
+
 ** Build-related
 
   "make dist" builds .tar.gz files again, as they are still used in
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 9d49e40..83e7087 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -120,6 +120,7 @@ TESTS =                                             \
   kwset-abuse                                  \
   long-line-vs-2GiB-read                       \
   long-pattern-perf                            \
+  many-regex-performance                       \
   match-lines                                  \
   max-count-overread                           \
   max-count-vs-context                         \
diff --git a/tests/many-regex-performance b/tests/many-regex-performance
new file mode 100755
index 0000000..ee68a04
--- /dev/null
+++ b/tests/many-regex-performance
@@ -0,0 +1,79 @@
+#!/bin/sh
+# Test for this performance regression:
+# grep-3.4 would require O(N^2) RSS for N regexps
+# grep-3.5 requires O(N) in the most common cases.
+
+# Copyright 2020 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+# This test is susceptible to failure due to differences in
+# system load during the two test runs, so we'll mark it as
+# "expensive", making it less likely to be run by regular users.
+expensive_
+
+# Make the quick/small input large enough so that even on high-end
+# systems this first invocation takes at least 10ms of user time.
+word_list=/usr/share/dict/linux.words
+
+# If $word_list does not exist, generate an input that exibhits
+# similar performance characteristics.
+if ! test -f $word_list; then
+  # Generate data comprable to that word list.
+  # Note how all "words" start with "a", and that there is
+  # a small percentage of lines with at least one "." metachar.
+  # This requires /dev/urandom, so if it's not present, skip
+  # this test. If desperate, we could fall back to using
+  # tar+compressed lib/*.c as the data source.
+  test -r /dev/urandom \
+    || skip_ 'this system has neither word list nor working /dev/urandom'
+  word_list=word_list
+  ( echo a; cat /dev/urandom           \
+    | LC_ALL=C tr -dc 'a-zA-Z0-9_'     \
+    | head -c500000                    \
+    | sed 's/\(........\)/\1\n/g'      \
+    | sed s/rs/./                      \
+    | sed s/./a/                       \
+    | sort                             \
+  ) > $word_list
+fi
+
+n_lines=2000
+while :; do
+  sed ${n_lines}q < $word_list > in || framework_failure_
+  small_ms=$(LC_ALL=C user_time_ 1 grep --file=in -v in) || fail=1
+  test $small_ms -ge 10 && break
+  n_lines=$(expr $n_lines + 2000)
+done
+
+# Now, run it again, but with 20 times as many lines.
+n_lines=$(expr $n_lines \* 20)
+sed ${n_lines}q < $word_list > in || framework_failure_
+large_ms=$(LC_ALL=C user_time_ 1 grep --file=in -v in) || fail=1
+
+# Deliberately recording in an unused variable so it
+# shows up in set -x output, in case this test fails.
+ratio=$(expr "$large_ms" / "$small_ms")
+
+# The duration of the larger run must be no more than 60 times
+# that of the small one.  Using recent versions prior to this fix,
+# this test would fail due to ratios larger than 300.  Using the
+# fixed version, it's common to see a ratio of 20-30.
+returns_ 1 expr $small_ms '<' $large_ms / 60 || fail=1
+
+Exit $fail

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=0bc552476f8c707de70289de2f3ca06294cd96c6


commit 4a9ad19352a014938b54203f8ec4c7e1e48c104f
Author: Jim Meyering <meyering@fb.com>
Date:   Mon Sep 21 13:06:15 2020 -0700

    tests: test for many-regexp N^2 RSS regression
    
    * tests/many-regex-performance: New test for this performance
    regression.
    * tests/Makefile.am: Add it.
    * NEWS (Bug fixes): Describe it.

diff --git a/NEWS b/NEWS
index 79b9db0..442d4d2 100644
--- a/NEWS
+++ b/NEWS
@@ -39,6 +39,14 @@ GNU grep NEWS                                    -*- outline 
-*-
   A performance regression with many duplicate patterns has been fixed.
   [Bug#43040 introduced in grep 3.4]
 
+  An N^2 RSS performance regression with many patterns has been fixed
+  in common cases (no backref, and no use of -o or --color).
+  With only 80,000 lines of /usr/share/dict/linux.words, the following
+  would use 100GB of RSS and take 3 minutes. With the fix, it used less
+  than 400MB and took less than one second:
+    head -80000 /usr/share/dict/linux.words > w; grep -vf w w
+  [Bug#43527 introduced in grep 3.4]
+
 ** Build-related
 
   "make dist" builds .tar.gz files again, as they are still used in
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 9d49e40..83e7087 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -120,6 +120,7 @@ TESTS =                                             \
   kwset-abuse                                  \
   long-line-vs-2GiB-read                       \
   long-pattern-perf                            \
+  many-regex-performance                       \
   match-lines                                  \
   max-count-overread                           \
   max-count-vs-context                         \
diff --git a/tests/many-regex-performance b/tests/many-regex-performance
new file mode 100755
index 0000000..ee68a04
--- /dev/null
+++ b/tests/many-regex-performance
@@ -0,0 +1,79 @@
+#!/bin/sh
+# Test for this performance regression:
+# grep-3.4 would require O(N^2) RSS for N regexps
+# grep-3.5 requires O(N) in the most common cases.
+
+# Copyright 2020 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+fail=0
+
+# This test is susceptible to failure due to differences in
+# system load during the two test runs, so we'll mark it as
+# "expensive", making it less likely to be run by regular users.
+expensive_
+
+# Make the quick/small input large enough so that even on high-end
+# systems this first invocation takes at least 10ms of user time.
+word_list=/usr/share/dict/linux.words
+
+# If $word_list does not exist, generate an input that exibhits
+# similar performance characteristics.
+if ! test -f $word_list; then
+  # Generate data comprable to that word list.
+  # Note how all "words" start with "a", and that there is
+  # a small percentage of lines with at least one "." metachar.
+  # This requires /dev/urandom, so if it's not present, skip
+  # this test. If desperate, we could fall back to using
+  # tar+compressed lib/*.c as the data source.
+  test -r /dev/urandom \
+    || skip_ 'this system has neither word list nor working /dev/urandom'
+  word_list=word_list
+  ( echo a; cat /dev/urandom           \
+    | LC_ALL=C tr -dc 'a-zA-Z0-9_'     \
+    | head -c500000                    \
+    | sed 's/\(........\)/\1\n/g'      \
+    | sed s/rs/./                      \
+    | sed s/./a/                       \
+    | sort                             \
+  ) > $word_list
+fi
+
+n_lines=2000
+while :; do
+  sed ${n_lines}q < $word_list > in || framework_failure_
+  small_ms=$(LC_ALL=C user_time_ 1 grep --file=in -v in) || fail=1
+  test $small_ms -ge 10 && break
+  n_lines=$(expr $n_lines + 2000)
+done
+
+# Now, run it again, but with 20 times as many lines.
+n_lines=$(expr $n_lines \* 20)
+sed ${n_lines}q < $word_list > in || framework_failure_
+large_ms=$(LC_ALL=C user_time_ 1 grep --file=in -v in) || fail=1
+
+# Deliberately recording in an unused variable so it
+# shows up in set -x output, in case this test fails.
+ratio=$(expr "$large_ms" / "$small_ms")
+
+# The duration of the larger run must be no more than 60 times
+# that of the small one.  Using recent versions prior to this fix,
+# this test would fail due to ratios larger than 300.  Using the
+# fixed version, it's common to see a ratio of 20-30.
+returns_ 1 expr $small_ms '<' $large_ms / 60 || fail=1
+
+Exit $fail

-----------------------------------------------------------------------

Summary of changes:
 NEWS                         |  8 +++++
 gnulib                       |  2 +-
 src/dfasearch.c              | 30 +++++++++--------
 src/grep.c                   |  7 ++--
 src/kwsearch.c               |  6 ++--
 src/pcresearch.c             |  2 +-
 src/search.h                 |  6 ++--
 tests/Makefile.am            |  2 ++
 tests/many-regex-performance | 79 ++++++++++++++++++++++++++++++++++++++++++++
 tests/stack-overflow         |  5 +++
 10 files changed, 123 insertions(+), 24 deletions(-)
 create mode 100755 tests/many-regex-performance


hooks/post-receive
-- 
grep
[Prev in Thread]
Current Thread
[Next in Thread]
grep branch, master, updated. v3.4-almost-38-g4a9ad19, Jim Meyering <=
Prev by Date: grep branch, master, updated. v3.4-almost-34-g1444b49
Next by Date: grep branch, master, updated. v3.4-almost-39-gc4d7c42
Previous by thread: grep branch, master, updated. v3.4-almost-34-g1444b49
Next by thread: grep branch, master, updated. v3.4-almost-39-gc4d7c42
Index(es):
- Date
- Thread