grep-commit
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep branch, master, updated. v3.0-19-gc57fba2


From: Paul Eggert
Subject: grep branch, master, updated. v3.0-19-gc57fba2
Date: Wed, 31 May 2017 14:18:11 -0400 (EDT)

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".

The branch, master has been updated
       via  c57fba2604b4e1095e869de6c844d8c3d973814f (commit)
       via  754b2d36a1b98952df4074a2fa13035e40e14786 (commit)
      from  40b095bd2d069bcbec32f8919a9e891fc79c26f4 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=c57fba2604b4e1095e869de6c844d8c3d973814f


commit c57fba2604b4e1095e869de6c844d8c3d973814f
Author: Paul Eggert <address@hidden>
Date:   Wed May 31 11:17:27 2017 -0700

    Document grep performance
    
    * doc/grep.texi (Performance): New section.

diff --git a/doc/grep.texi b/doc/grep.texi
index 077c05f..728c039 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -64,6 +64,7 @@ This manual is for version @value{VERSION} of GNU Grep.
 * Invoking::                    Command-line options, environment, exit status.
 * Regular Expressions::         Regular Expressions.
 * Usage::                       Examples.
+* Performance::                 Performance tuning.
 * Reporting Bugs::              Reporting Bugs.
 * Copying::                     License terms for this manual.
 * Index::                       Combined index.
@@ -1789,6 +1790,93 @@ g/re/p
 @end enumerate
 
 
address@hidden Performance
address@hidden Performance
+
address@hidden performance
+Typically @command{grep} is an efficient way to search text.  However,
+it can be quite slow in some cases, and it can search large files
+where even minor performance tweaking can help significantly.
+Although the algorithm used by @command{grep} is an implementation
+detail that can change from release to release, understanding its
+basic strengths and weaknesses can help you improve its performance.
+
+The @command{grep} command operates partly via a set of automata that
+are designed for efficiency, and partly via a slower matcher that
+takes over when the fast matchers run into unusual features like
+back-references.  When feasible, the Boyer--Moore fast string
+searching algorithm is used to match a single fixed pattern, and the
+Aho--Corasick algorithm is used to match multiple fixed patterns.
+
address@hidden locales
+Generally speaking @command{grep} operates more efficiently in
+single-byte locales, since it can avoid the special processing needed
+for multi-byte characters.  If your pattern will work just as well
+that way, setting @env{LC_ALL} to a single-byte locale can help
+performance considerably.  Setting @samp{LC_ALL='C'} can be
+particularly efficient, as @command{grep} is tuned for that locale.
+
address@hidden case insensitive search
+Outside the @samp{C} locale, case-insensitive search, and search for
+bracket expressions like @samp{[a-z]} and @samp{[[=a=]b]}, can be
+suprisingly inefficient due to difficulties in fast portable access to
+concepts like multi-character collating elements.
+
address@hidden back-references
+A back-reference such as @samp{\1} can hurt performance significantly
+in some cases, since back-references cannot in general be implemented
+via a finite state automaton, and instead trigger a backtracking
+algorithm that can be quite inefficient.  For example, although the
+pattern @samp{^(.*)address@hidden@}(.*)address@hidden@}$} matches only lines 
whose
+lengths can be written as a sum @math{15x + 14y} for nonnegative
+integers @math{x} and @math{y}, the pattern matcher does not perform
+linear Diophantine analysis and instead backtracks through all
+possible matching strings, using an algorithm that is exponential in
+the worst case.
+
address@hidden holes in files
+On some operating systems that support files with holes---large
+regions of zeros that are not physically present on secondary
address@hidden can skip over the holes efficiently without
+needing to read the zeros.  This optimization is not available if the
address@hidden (@option{--text}) option is used (@pxref{File and
+Directory Selection}), unless the @option{-z} (@option{--null-data})
+option is also used (@pxref{Other Options}).
+
+For more about the algorithms used by @command{grep} and about
+related string matching algorithms, see:
+
address@hidden on
address@hidden @bullet
address@hidden
+Aho AV. Algorithms for finding patterns in strings.
+In: van Leeuwen J. @emph{Handbook of Theoretical Computer Science}, vol. A.
+New York: Elsevier; 1990. p. 255--300.
+This surveys classic string matching algorithms, some of which are
+used by @command{grep}.
+
address@hidden
+Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search.
address@hidden 1975;18(6):333--40.
address@hidden://dx.doi.org/10.1145/360825.360855}.
+This introduces the Aho--Corasick algorithm.
+
address@hidden
+Boyer RS, Moore JS. A fast string searching algorithm.
address@hidden 1977;20(10):762--72.
address@hidden://dx.doi.org/10.1145/359842.359859}.
+This introduces the Boyer--Moore algorithm.
+
address@hidden
+Faro S, Lecroq T. The exact online string matching problem: a review
+of the most recent results.
address@hidden Comput Surv}. 2013;45(2):13.
address@hidden://dx.doi.org/10.1145/2431211.2431212}.
+This surveys string matching algorithms that might help improve the
+performance of @command{grep} in the future.
address@hidden itemize
address@hidden off
+
 @node Reporting Bugs
 @chapter Reporting bugs
 

http://git.savannah.gnu.org/cgit/grep.git/commit/?id=754b2d36a1b98952df4074a2fa13035e40e14786


commit c57fba2604b4e1095e869de6c844d8c3d973814f
Author: Paul Eggert <address@hidden>
Date:   Wed May 31 11:17:27 2017 -0700

    Document grep performance
    
    * doc/grep.texi (Performance): New section.

diff --git a/doc/grep.texi b/doc/grep.texi
index 077c05f..728c039 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -64,6 +64,7 @@ This manual is for version @value{VERSION} of GNU Grep.
 * Invoking::                    Command-line options, environment, exit status.
 * Regular Expressions::         Regular Expressions.
 * Usage::                       Examples.
+* Performance::                 Performance tuning.
 * Reporting Bugs::              Reporting Bugs.
 * Copying::                     License terms for this manual.
 * Index::                       Combined index.
@@ -1789,6 +1790,93 @@ g/re/p
 @end enumerate
 
 
address@hidden Performance
address@hidden Performance
+
address@hidden performance
+Typically @command{grep} is an efficient way to search text.  However,
+it can be quite slow in some cases, and it can search large files
+where even minor performance tweaking can help significantly.
+Although the algorithm used by @command{grep} is an implementation
+detail that can change from release to release, understanding its
+basic strengths and weaknesses can help you improve its performance.
+
+The @command{grep} command operates partly via a set of automata that
+are designed for efficiency, and partly via a slower matcher that
+takes over when the fast matchers run into unusual features like
+back-references.  When feasible, the Boyer--Moore fast string
+searching algorithm is used to match a single fixed pattern, and the
+Aho--Corasick algorithm is used to match multiple fixed patterns.
+
address@hidden locales
+Generally speaking @command{grep} operates more efficiently in
+single-byte locales, since it can avoid the special processing needed
+for multi-byte characters.  If your pattern will work just as well
+that way, setting @env{LC_ALL} to a single-byte locale can help
+performance considerably.  Setting @samp{LC_ALL='C'} can be
+particularly efficient, as @command{grep} is tuned for that locale.
+
address@hidden case insensitive search
+Outside the @samp{C} locale, case-insensitive search, and search for
+bracket expressions like @samp{[a-z]} and @samp{[[=a=]b]}, can be
+suprisingly inefficient due to difficulties in fast portable access to
+concepts like multi-character collating elements.
+
address@hidden back-references
+A back-reference such as @samp{\1} can hurt performance significantly
+in some cases, since back-references cannot in general be implemented
+via a finite state automaton, and instead trigger a backtracking
+algorithm that can be quite inefficient.  For example, although the
+pattern @samp{^(.*)address@hidden@}(.*)address@hidden@}$} matches only lines 
whose
+lengths can be written as a sum @math{15x + 14y} for nonnegative
+integers @math{x} and @math{y}, the pattern matcher does not perform
+linear Diophantine analysis and instead backtracks through all
+possible matching strings, using an algorithm that is exponential in
+the worst case.
+
address@hidden holes in files
+On some operating systems that support files with holes---large
+regions of zeros that are not physically present on secondary
address@hidden can skip over the holes efficiently without
+needing to read the zeros.  This optimization is not available if the
address@hidden (@option{--text}) option is used (@pxref{File and
+Directory Selection}), unless the @option{-z} (@option{--null-data})
+option is also used (@pxref{Other Options}).
+
+For more about the algorithms used by @command{grep} and about
+related string matching algorithms, see:
+
address@hidden on
address@hidden @bullet
address@hidden
+Aho AV. Algorithms for finding patterns in strings.
+In: van Leeuwen J. @emph{Handbook of Theoretical Computer Science}, vol. A.
+New York: Elsevier; 1990. p. 255--300.
+This surveys classic string matching algorithms, some of which are
+used by @command{grep}.
+
address@hidden
+Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search.
address@hidden 1975;18(6):333--40.
address@hidden://dx.doi.org/10.1145/360825.360855}.
+This introduces the Aho--Corasick algorithm.
+
address@hidden
+Boyer RS, Moore JS. A fast string searching algorithm.
address@hidden 1977;20(10):762--72.
address@hidden://dx.doi.org/10.1145/359842.359859}.
+This introduces the Boyer--Moore algorithm.
+
address@hidden
+Faro S, Lecroq T. The exact online string matching problem: a review
+of the most recent results.
address@hidden Comput Surv}. 2013;45(2):13.
address@hidden://dx.doi.org/10.1145/2431211.2431212}.
+This surveys string matching algorithms that might help improve the
+performance of @command{grep} in the future.
address@hidden itemize
address@hidden off
+
 @node Reporting Bugs
 @chapter Reporting bugs
 

-----------------------------------------------------------------------

Summary of changes:
 doc/grep.texi | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gnulib        |  2 +-
 2 files changed, 89 insertions(+), 1 deletion(-)


hooks/post-receive
-- 
grep



reply via email to

[Prev in Thread] Current Thread [Next in Thread]