bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x)


From: Jim Meyering
Subject: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Wed, 19 Feb 2014 19:44:59 -0800

Hmm... it's not as clear-cut as I first thought.
(I built 2.17+ the above patch and put it in a directory named grep-2.18)

The following times 2.16, 2.17 and 2.17+patch two ways:

$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
       15.96 real        14.57 user         0.12 sys
17
        1.13 real         1.07 user         0.06 sys
18
        1.96 real         1.89 user         0.06 sys

The above search takes more than 70% longer with the proposed patch.

Contrast that with performance in the non-UTF8 ja_JP.eucJP locale:

$ yes $(printf '%078dm' 0)|head -10000 > in
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i n in; done
16
        0.03 real         0.02 user         0.00 sys
17
        2.98 real         2.96 user         0.00 sys
18
        0.02 real         0.02 user         0.00 sys

Using the jjj+foobar example, but with only 100k lines, we see there
was a 200x performance regression going from grep-2.16 to 2.17:

$ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -100000 > k
$ for i in 16 17 18; do echo $i; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16
        0.15 real         0.14 user         0.00 sys
17
       27.74 real        27.72 user         0.01 sys
18
        0.11 real         0.11 user         0.00 sys

Obviously, I want to retain all of 2.17's performance gain in UTF-8 locales,
while avoiding the 200x penalty in multi-byte non-UTF8 locales like ja_JP.eucJP.
So I have prepared a better patch.
With the two attached commits (on top of 2.17), I get these timings,
i.e., the same 200x improvement with ja_JP.eucJP, and no regression
with en_US.UTF8)

$ for i in 16 17 18; do printf "$i: "; env LC_ALL=ja_JP.eucJP time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16:         0.14 real         0.14 user         0.00 sys
17:        27.97 real        27.95 user         0.01 sys
18:         0.12 real         0.12 user         0.00 sys

$ for i in 16 17 18; do printf "$i: "; env LC_ALL=en_US.UTF-8 time
/p/p/grep-2.$i/bin/grep -i foobar k; done
16:         0.13 real         0.12 user         0.00 sys
17:         0.01 real         0.01 user         0.00 sys
18:         0.01 real         0.01 user         0.00 sys

Attachment: k.txt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]