[bug-grep] UTF-8 performance: progress report

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-grep] UTF-8 performance: progress report

From:	Tim Waugh
Subject:	[bug-grep] UTF-8 performance: progress report
Date:	Wed, 15 Dec 2004 13:29:55 +0000
User-agent:	Mutt/1.4.1i

Hi,

I have been working on improving grep's performance in the UTF-8
encoding, and thought I'd send a progress report.

Below are some simple benchmarking results comparing two binaries:

* grep-2.5.1 as released, configured with --without-included-regex

* grep-2.5.1-31.3, built for Fedora Core 3.  Several patches are
  applied, and it is configured with --without-included-regex.  Among
  the applied patches:

  o dfa-optional: this makes the use of the DFA conditional on whether
    the current locale character encoding is a multibyte one.  For
    UTF-8, the DFA is turned off.  I posted results of this
    improvement early last month.

  o egf-speedup: this reduces the multibyte processing (mbrtowc etc)
    considerably by only using it when necessary.  For the special
    case of UTF-8, without using the built-in DFA, this can be never
    as far as grep is concerned; of course the system re_search()
    function has to be aware of multibyte handling.

Both run on the same machine, and the installed C library is
glibc-2.3.3-90 from the Fedora development repository.

Here is the simple test script I used:

==>
perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
echo "        ASCII:" > a
(export LANG=C; time $GREP 'foo' input) 2>&1 | grep user >> a
(export LANG=C; time $GREP '0.3' input) 2>&1 | grep user >> a
(export LANG=C; time $GREP -v '$' input) 2>&1 | grep user >> a
(export LANG=C; time $GREP -v '90123456789' input) 2>&1 | grep user >> a
echo "        UTF-8:" > b
(export LANG=en_GB.UTF-8; time $GREP 'foo' input) 2>&1 | grep user >> b
(export LANG=en_GB.UTF-8; time $GREP '0.3' input) 2>&1 | grep user >> b
(export LANG=en_GB.UTF-8; time $GREP -v '$' input) 2>&1 | grep user >> b
(export LANG=en_GB.UTF-8; time $GREP -v '90123456789' input) 2>&1 | grep user 
>> b
paste <(expand a) <(expand b)
<==

First the results for grep-2.5.1 as released:

        ASCII:          UTF-8:
user    0m0.125s        user    0m9.460s
user    0m0.554s        user    0m25.188s
user    0m2.464s        user    39m26.313s
user    0m0.293s        user    35m55.760s

Now the much-improved results for grep-2.5.1-31.3:

        ASCII:          UTF-8:
user    0m0.123s        user    0m0.126s
user    0m0.564s        user    0m13.152s
user    0m2.500s        user    0m12.179s
user    0m0.293s        user    0m0.291s

For the last test, the UTF-8 processing appears faster than the ASCII
processing.  This shows that for that pattern, what overhead UTF-8 may
incur is lost in the noise.

You can see the patches that are applied in grep-2.5.1-31.3 here:

  ftp://people.redhat.com/twaugh/tmp/grep/fc3/unpacked/

Tim.
*/

pgplj4nkqNMB7.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-grep] UTF-8 performance: progress report, Tim Waugh <=

Prev by Date: Re: [bug-grep] doc bug in grep; PATTERN implies a regexp used in --include or --exclude
Next by Date: [bug-grep] phrasing fixes
Previous by thread: [bug-grep] Re: A test case worth optimizing for
Next by thread: [bug-grep] phrasing fixes
Index(es):
- Date
- Thread