bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#23932: dfa: use algorithm for single byte character to any single by


From: Norihiro Tanaka
Subject: bug#23932: dfa: use algorithm for single byte character to any single byte character in input text always
Date: Sun, 10 Jul 2016 18:51:43 +0900

In multibyte locales, if a pattern start with period expression,
matching is still slow, as transition table is built at run time,
even when next character is single byte in input text.

This patch changes it into as use algorithm for single byte character to
any single byte character in input text always.  If transition table has
been built already and a next character in input text is single byte,
transit to next state by reference of only pre-built transition table,
even if from a state including ANYCHAR.

$ yes "$(printf 'a%038db\n' 0)" | head -1000000 >in
$ env LC_ALL=C gcc -v
Reading specs from /usr/local/lib/gcc/x86_64-pc-linux-gnu/4.4.7/specs
Target: x86_64-pc-linux-gnu
Configured with: ./configure --with-as=/usr/local/bin/as 
--with-ld=/usr/local/bin/ld --with-system-zlib --enable-__cxa_atexit
Thread model: posix
gcc version 4.4.7 (GCC)

patch#21486 is required before this patch.  grep will speed up by this
patch additionaly.

[grep-2.25]
$ time -p env LC_ALL=ja_JP.eucjp grep .a.b in
real 4.78
user 4.42
sys 0.16
$ time -p env LC_ALL=ja_JP.eucjp grep '.\{41\}' in
real 46.23
user 43.98
sys 0.21

[after patch#21486]
$ time -p env LC_ALL=ja_JP.eucjp src/grep .a.b in
real 1.26
user 1.08
sys 0.08
$ time -p env LC_ALL=ja_JP.eucjp src/grep '.\{41\}' in
real 1.14
user 1.00
sys 0.10

[after this patch]
$ time -p env LC_ALL=ja_JP.eucjp src/grep .a.b in
real 0.47
user 0.36
sys 0.07
$ time -p env LC_ALL=ja_JP.eucjp src/grep '.\{41\}' in
real 0.24
user 0.18
sys 0.05

[locale C (ref.)]
$ time -p env LC_ALL=C src/grep .a.b in
real 0.23
user 0.11
sys 0.09
$ time -p env LC_ALL=C src/grep '.\{41\}' in
real 0.22
user 0.13
sys 0.06

Attachment: 0001-dfa-use-algorithm-for-single-byte-character-to-any-s.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]