--- Begin Message ---
Subject: |
[PATCH] grep: try fgrep matcher for case insensitive matching by grep -F in multibyte locale |
Date: |
Sun, 12 Jun 2016 18:47:58 +0900 |
In grep 2.19 or later, grep -F use grep matcher for case insensitive
matching in multibyte locale. However, it causes poor performance for a
long pattern bacause of building DFA.
By this patch, in multibyte locale, if a pattern is composed of only
single byte characters and their all counterparts are also single byte
characters and the pattern does not have invalid sequences, grep -F uses
fgrep matcher same as single byte locale.
It fixes bug#21763 and bug#22239 partially.
$ seq -f '%g bottles of beer on the wall' 1 600 >pat
$ tr a-z A-Z <pat >in
(before)
$ time -p env LC_ALL=C src/grep -Fivf pat in
real 0.08
user 0.03
sys 0.03
$ time -p env LC_ALL=ja_JP.eucjp src/grep -Fivf pat in
real 104.84
user 93.32
sys 3.28
(after)
$ time -p env LC_ALL=C src/grep -Fivf pat in
real 0.09
user 0.03
sys 0.04
$ time -p env LC_ALL=ja_JP.eucjp src/grep -Fivf pat in
real 0.08
user 0.03
sys 0.03
If a pattern has any multibyte character, grep -F is still slow.
$ printf '\xb3\xa4\n' >>pat
$ time -p env LC_ALL=ja_JP.eucjp src/grep -Fivf pat in
real 103.38
user 93.81
sys 2.46
0001-grep-try-fgrep-matcher-for-case-insensitive-matching.patch
Description: Text document
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#23752: [PATCH] grep: try fgrep matcher for case insensitive matching by grep -F in multibyte locale |
Date: |
Thu, 1 Sep 2016 09:50:11 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 |
Thanks for that performance improvement. I rebased the patch (1st attachment)
and wrote some followup changes (2nd attachment) and installed them into the
Savannah master.
If a pattern has any multibyte character, grep -F is still slow.
Suppose all the multibyte characters in the pattern are non-letters, so that
case-folding does not affect them. Could grep -iF be fast in that case?
Is the problem that some encodings allow two different representations for the
same character, and we want the pattern to match both representations?
0001-grep-speed-up-iF-in-multibyte-locales.txt
Description: Text document
0002-grep-avoid-code-duplication-with-iF.txt
Description: Text document
--- End Message ---