bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #13859] matching '([A]|[B]){2}' in different locales


From: Tim Waugh
Subject: Re: [bug #13859] matching '([A]|[B]){2}' in different locales
Date: Thu, 21 Jul 2005 10:16:40 +0100
User-agent: Mutt/1.4.2.1i

On Wed, Jul 20, 2005 at 05:51:14PM -0400, Charles Levert wrote:

> Does the answer depend on the chosen locale?
> Is there still a justification for grep's DFA?

The answer may well depend on the locale, and there may indeed still
be justification for grep's own DFA.  Here are the results of some
timings I did a while ago and posted to the old grep-devel list:

Date: Fri, 5 Nov 2004 18:14:04 +0000
Message-ID: <address@hidden>
Subject: GREP_NO_DFA environment variable
From: Tim Waugh <address@hidden>
To: grep-devel-list <address@hidden>

--hi1pl0KEDnHSGXJ/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi,

Following Aharon's lead I've made a patch for grep to make it avoid
using dfaexec when the GREP_NO_DFA environment variable is set.  This
is to facilitate more accurate benchmarking of the effect of DFA on a
variety of searches.

I've tried some tests, and have drawn a conclusion: scroll down for
my suggestion as to how to make grep fast in all cases.

C locale
--------

For some tests I find no measureable advantage to DFA:

$ export LC_CTYPE=C
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep foo input) 2>&1 | grep user
user    0m0.126s
$ (export GREP_NO_DFA=1 ; time ./grep foo input) 2>&1 | grep user
user    0m0.132s

(The results for each test varied by over 0.1s.)

For other tests, the DFA seemed to offer a significant advantage:

$ export LC_CTYPE=C
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep 0.3 input) 2>&1 | grep user
user    0m0.571s
$ (export GREP_NO_DFA=1; time ./grep 0.3 input) 2>&1 | grep user
user    0m9.342s
$ (time ./grep -v $ input) 2>&1 | grep user
user    0m2.619s
$ (export GREP_NO_DFA=1; time ./grep -v $ input) 2>&1 | grep user
user    0m11.937s

For these three types of search ("foo", "0.3", and "$"), DFA
definitely seems to be a win -- in the C locale.

UTF-8 locales
-------------

When using UTF-8 it is an entirely different matter, to put it
mildly.  Although searches in which the pattern is rare do not show
much difference in running times:

$ export LC_CTYPE=en_GB.UTF-8
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep foo input) 2>&1 | grep user
user    0m0.132s
$ (export GREP_NO_DFA=1 ; time ./grep foo input) 2>&1 | grep user
user    0m0.132s

searches with common patterns are adversely affected by DFA:

$ export LC_CTYPE=en_GB.UTF-8
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep 0.3 input) 2>&1 | grep user
user    0m15.713s
$ (export GREP_NO_DFA=1 ; time ./grep 0.3 input) 2>&1 | grep user
user    0m12.542s

Sometimes wildly:

$ export LC_CTYPE=en_GB.UTF-8
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep -v $ input) 2>&1 | grep user
user    40m4.108s
$ (export GREP_NO_DFA=1 ; time ./grep -v $ input) 2>&1 | grep user
user    0m11.958s

CONCLUSION
==========

My suggestion is: let's turn off DFA when MB_CUR_MAX > 1.  While it
speeds up ASCII searching by quite some measure, it slows down UTF-8
searching (for example) by very much more for no appreciable gain.

The source code used for these tests is the Fedora Core RPM
grep-2.5.1-35, available at:

  ftp://people.redhat.com/twaugh/tmp/grep/

Tim.
*/

--hi1pl0KEDnHSGXJ/
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBi8LsHU/d4jnpWe0RAgGkAJ4tv95XkTVISby0HZpLV5MXEWEvPACaAqUg
US0gOeHnl0d/Q0rBeNxgy9A=
=tzH1
-----END PGP SIGNATURE-----

--hi1pl0KEDnHSGXJ/--





reply via email to

[Prev in Thread] Current Thread [Next in Thread]