help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scanner takes a long time to build.


From: Scott A Crosby
Subject: Re: Scanner takes a long time to build.
Date: Thu, 25 Jul 2002 17:50:41 -0400 (EDT)

On Thu, 25 Jul 2002, John Millaway wrote:

> > Paragraph\ .a.{0,10}2.{0,10}C.\ of\ S.\ 1618
>
> That last regex is enough to blow up the DFA. It's not the number of regular
> expressions killing your flex file, it's the patterns. Have a look:
>

I suspected it was the non-determinism and backing up of '.', but I'd
forgotten that you could simplify the DFA's immensily by doing
    [^X].{,10}X###
so that it doesn't have to worry about non-determinism of which 'X' in
(say 'XXyyXXyyXX') indicates its the time to go on to try to match the ###
part of the regexp.

Thanks for reminding me of the trick.

> Paragraph\ .a[^2\n]{0,10}2[^C\n]{0,10}C.\ of\ S.\ 1618  ==> [89]
>
> I replaced the first . with [^2\n], and the second . with [^C\n]. This may not
> be what you want, but it shows the drastic reduction in DFA size. You'll have
> to rethink your regex's. Refer to the O'Reilly book , "Mastering Regular
> Expressions". This subject is addressed.

I was prototyping how well flex could run the rules in Spamassassin. So, I
wrote a quick script to rewrite the regexp's in its rulebase into
something flex could handle. (skipping the regexp's that couldn't be
easily mechanically transformed.)

I've updated the my program to now include your suggestion. It works
*MUCH* better now, but this change to the regexp's alters the semantics
too much. (for example:  'copyright[^a]{0,100}all\ rights\ reserved'
means much different than 'copyright.{0,100}all\ rights\ reserved')

Thanks for your help, though I've concluded that its not practical to use
flex to speed up spamassassin.

Scott




reply via email to

[Prev in Thread] Current Thread [Next in Thread]