bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug#40634: Massive pattern list handling with -E format seems very s


From: Jim Meyering
Subject: Re: bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.
Date: Fri, 11 Sep 2020 15:17:29 +0200

On Fri, Sep 11, 2020 at 2:47 PM Jim Meyering <jim@meyering.net> wrote:
> On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > On Sun, 19 Apr 2020 07:41:49 +0900
> > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > On Sat, 18 Apr 2020 00:22:26 +0900
> > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > >
> > > >
> > > > On Fri, 17 Apr 2020 10:24:42 +0900
> > > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > >
> > > > >
> > > > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > > >
> > > > > >
> > > > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > > > Paul Eggert <eggert@cs.ucla.edu> wrote:
> > > > > >
> > > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > > >
> > > > > > > > I have had no idea to solve the problem yet.  If we revert it, 
> > > > > > > > bug#33357
> > > > > > > > will come back.
> > > > > > >
> > > > > > > Yes, I'd rather not revert if we can help it.
> > > > > > >
> > > > > > > My own thought was to not analyze the regular expression if we 
> > > > > > > discover that the input is empty. :-)
> > > > > >
> > > > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > > > including in follows before remove epsilon nodes.
> > > > >
> > > > >
> > > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > > >
> > > > It was improved previous patch.
> > >
> > > Sorry, correct patch is here.
> >
> > I made the previous patch even simpler.
> >
> > before:
> >
> > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> > real 7.24
> > user 7.14
> > sys 0.09
> >
> > after:
> >
> > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> > real 0.62
> > user 0.52
> > sys 0.10
>
> Thank you for this patch. I have rebased and made minor syntactic changes.
> I'll push it to gnulib soon, if not today, then by Monday.
>
> I am considering creating a test case in grep, but it feels too tight
> to be feasible: I would use a relative perf test, requiring that a
> passing test incur a perf cost of less than say 100x. Here's the
> beginnings of my attempt (note: this is just an outline -- obviously
> would not rely on having "time" in path or as a shell builtin):
>
> gen()
> {
>   local n=$1
>   local i=1
>   while : ; do
>     local pat=$(printf $i | sha1sum | cut -d' ' -f1)
>     printf '%s\n' "$pat$pat(\$|$pat)"
>     i=$(expr $i + 1)
>     test $i = $n && break
>   done
> }
>
> gen 4000 > pats-4000
> head -400 pats-4000 > pats-400
>
> # With fixed code, that a 10x input size increase (n=400 to 4000)
> # induces a 40x runtime increase: .05 -> 2.0s
> # Just prior to this change, it's 150x: 0.2 -> 30s
>
> env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null
> env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null

And here is the adjusted patch:

Attachment: dfa.c-epsilon-node-removal-speedup.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]