bug-grep
[Top][All Lists]

## bug#16966: [PATCH] grep: optimization with the superset of DFA

 From: Norihiro Tanaka Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA Date: Wed, 02 Apr 2014 22:02:01 +0900

```Paolo Bonzini wrote:
> Does anything change if there are a few million c's?

The superset of `a ANYCHAR b' is 'a CSET STAR b'.
It's DFA states are following.

s0: The position set is none.
s1: The position set is 1:a
s2: The position set is 1:a 2:CSET
s3: The position set is 1:a 2:CSET 3:b (accepted)

1  accccccccccccccccccccccccccccccccccccccc
^ s1

1  accccccccccccccccccccccccccccccccccccccc
^ s2

1  accccccccccccccccccccccccccccccccccccccc
^ s2

......

1,000,000  cccccccccccccccccccccccccccccccccccccccb
^ s2

1,000,000  cccccccccccccccccccccccccccccccccccccccb
^ s3 accepted

Then re-searched with the superset because matched on multi-lines.

1,000,000  cccccccccccccccccccccacccccccccccccccccb
^ s0

1,000,000  cccccccccccccccccccccacccccccccccccccccb
^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
^ s3 accepted

Then re-searched with the original DFA becuase matched on one-line.
The DFA states are following.

s0: The position set is none.
s1: The position set is 1:a
s2: The position set is 1:a 2:ANYCHAR
s3: The position set is 1:a 2:ANYCHAR 3:b (accepted)

1,000,000  cccccccccccccccccccccacccccccccccccccccb
^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
^ s0 rejected

Even if matched multi-line, It's still as fast and memory-required as
matched single-line, because in search-mode STAR doesn't count number
of NFA and DFA states up.  So for example, a set of DFA states for
`a CSET STAR b' is correspond with for `a CSET b'. (If my recognition is
right.)

OTOH, If CSET{1,mb_cur_max} is assigned for ANYCHAR, number of DFA
states will be greater than above, and I may be slightly slower
becuase of overheads to build DFA states.  Notice that `{m,n}'
http://www.gnu.org/software/grep/manual/grep.html#Reporting-Bugs).

Example, use the superset for following pattern in EUC-JP locale.
It's mb_cur_max == 3.

a.....b

If ANYCHAR is replaced to `CSET STAR', as following.

a CSET STAR CSET STAR CSET STAR CSET STAR CSET STAR CSET STAR b

I draw it below.

/\
/  \  2:CSET
\  /
\/
1:a ---> 3:b
(The figure drawn previously was wrong.)

It's equal to following.

a CSET STAR b

OTOH, the period is replaced to `CSET CSET CSET CAT OR CSET CSET CAT
CSET CAT OR', below.  Itt's complicated, and I don't want to draw it. :)

a CSET CSET CSET CAT OR CSET CSET CAT CSET CAT OR CSET CSET CSET
CAT OR CSET CSET CAT CSET CAT OR CAT CSET CSET CSET CAT OR CSET
CSET CAT CSET CAT OR CAT CSET CSET CSET CAT OR CSET CSET CAT CSET
CAT OR CAT b

Thanks,
Norihiro

```