[Grep-devel] [PATCH 2/2] Document grep performance

grep-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Grep-devel] [PATCH 2/2] Document grep performance

From:	Paul Eggert
Subject:	[Grep-devel] [PATCH 2/2] Document grep performance
Date:	Wed, 31 May 2017 11:19:47 -0700

* doc/grep.texi (Performance): New section.
---
 doc/grep.texi | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/doc/grep.texi b/doc/grep.texi
index 077c05f..728c039 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -64,6 +64,7 @@ This manual is for version @value{VERSION} of GNU Grep.
 * Invoking::                    Command-line options, environment, exit status.
 * Regular Expressions::         Regular Expressions.
 * Usage::                       Examples.
+* Performance::                 Performance tuning.
 * Reporting Bugs::              Reporting Bugs.
 * Copying::                     License terms for this manual.
 * Index::                       Combined index.
@@ -1789,6 +1790,93 @@ g/re/p
 @end enumerate
 
 
address@hidden Performance
address@hidden Performance
+
address@hidden performance
+Typically @command{grep} is an efficient way to search text.  However,
+it can be quite slow in some cases, and it can search large files
+where even minor performance tweaking can help significantly.
+Although the algorithm used by @command{grep} is an implementation
+detail that can change from release to release, understanding its
+basic strengths and weaknesses can help you improve its performance.
+
+The @command{grep} command operates partly via a set of automata that
+are designed for efficiency, and partly via a slower matcher that
+takes over when the fast matchers run into unusual features like
+back-references.  When feasible, the Boyer--Moore fast string
+searching algorithm is used to match a single fixed pattern, and the
+Aho--Corasick algorithm is used to match multiple fixed patterns.
+
address@hidden locales
+Generally speaking @command{grep} operates more efficiently in
+single-byte locales, since it can avoid the special processing needed
+for multi-byte characters.  If your pattern will work just as well
+that way, setting @env{LC_ALL} to a single-byte locale can help
+performance considerably.  Setting @samp{LC_ALL='C'} can be
+particularly efficient, as @command{grep} is tuned for that locale.
+
address@hidden case insensitive search
+Outside the @samp{C} locale, case-insensitive search, and search for
+bracket expressions like @samp{[a-z]} and @samp{[[=a=]b]}, can be
+suprisingly inefficient due to difficulties in fast portable access to
+concepts like multi-character collating elements.
+
address@hidden back-references
+A back-reference such as @samp{\1} can hurt performance significantly
+in some cases, since back-references cannot in general be implemented
+via a finite state automaton, and instead trigger a backtracking
+algorithm that can be quite inefficient.  For example, although the
+pattern @samp{^(.*)address@hidden@}(.*)address@hidden@}$} matches only lines 
whose
+lengths can be written as a sum @math{15x + 14y} for nonnegative
+integers @math{x} and @math{y}, the pattern matcher does not perform
+linear Diophantine analysis and instead backtracks through all
+possible matching strings, using an algorithm that is exponential in
+the worst case.
+
address@hidden holes in files
+On some operating systems that support files with holes---large
+regions of zeros that are not physically present on secondary
address@hidden can skip over the holes efficiently without
+needing to read the zeros.  This optimization is not available if the
address@hidden (@option{--text}) option is used (@pxref{File and
+Directory Selection}), unless the @option{-z} (@option{--null-data})
+option is also used (@pxref{Other Options}).
+
+For more about the algorithms used by @command{grep} and about
+related string matching algorithms, see:
+
address@hidden on
address@hidden @bullet
address@hidden
+Aho AV. Algorithms for finding patterns in strings.
+In: van Leeuwen J. @emph{Handbook of Theoretical Computer Science}, vol. A.
+New York: Elsevier; 1990. p. 255--300.
+This surveys classic string matching algorithms, some of which are
+used by @command{grep}.
+
address@hidden
+Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search.
address@hidden 1975;18(6):333--40.
address@hidden://dx.doi.org/10.1145/360825.360855}.
+This introduces the Aho--Corasick algorithm.
+
address@hidden
+Boyer RS, Moore JS. A fast string searching algorithm.
address@hidden 1977;20(10):762--72.
address@hidden://dx.doi.org/10.1145/359842.359859}.
+This introduces the Boyer--Moore algorithm.
+
address@hidden
+Faro S, Lecroq T. The exact online string matching problem: a review
+of the most recent results.
address@hidden Comput Surv}. 2013;45(2):13.
address@hidden://dx.doi.org/10.1145/2431211.2431212}.
+This surveys string matching algorithms that might help improve the
+performance of @command{grep} in the future.
address@hidden itemize
address@hidden off
+
 @node Reporting Bugs
 @chapter Reporting bugs
 
-- 
2.9.4

[Prev in Thread]

Current Thread

[Next in Thread]

[Grep-devel] [PATCH 1/2] build: update gnulib submodule to latest, Paul Eggert, 2017/05/31
- [Grep-devel] [PATCH 2/2] Document grep performance, Paul Eggert <=

Prev by Date: [Grep-devel] [PATCH 1/2] build: update gnulib submodule to latest
Previous by thread: [Grep-devel] [PATCH 1/2] build: update gnulib submodule to latest
Index(es):
- Date
- Thread