help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scanner takes a long time to build.


From: Vern Paxson
Subject: Re: Scanner takes a long time to build.
Date: Sat, 27 Jul 2002 15:08:40 -0700

> The way I interpret it, the algorithm builds the DFA states as the parsing
> goes along

Flex builds NFA states as it parses the rules.  It then converts the entire
set into a DFA and spits out a set of tables representing the DFA.

So it could instead spit out tables representing the NFA, along with
code that would compile the NFA into a DFA at run-time, as needed.

> grammar to be dynamic: The DFA state translation mechanism needs to be
> incorporated, but the translation could be from say some NFA states one has
> derived from the grammar.

Yes.

> One could also think of Flex generating NFA/DFA hybrids: I just read in
> comp.compilers that one should use a NFA if one needs to keep track of
> parsing position relative the grammar, like when the patterns has groupings
> that should be extracted for substitution. For example, a pattern
>   "begin"[.*]"end"  { yylval.text = \1; return GROUPING; }
> where \1 would refer to the matched text between "begin" and "end" then.

I don't believe NFAs actually help with this.  To find non-trivial groupings
you need additional scan-time information.  (The above example is trivial,
since you know it starts 5 characters from the beginning and ends 3 from
the end, but if either is variable-length and can also match text in the group,
then it gets tricky.)

> >My main current project these days is Bro, a network intrusion detection
> >system.
> 
> So it isn't "Big Bro" then. :-)

That's why I chose the name, actually - as a reminder that such tools
have major privacy implications.

> It is interesting though that the other fellow wanting the feature was
> writing on an anti-spam tool. So it could happen that these "natural
> language pattern recognition" tools require grammars that may blow up
> exponentially in a straightforward DFA translation.

Two points.  First, the main application where this came up for me is
simply looking for fixed strings (but a whole lot of them) in packets or
byte streams - not a language problem at all, there's no syntax.  Second,
this is actually not the essence of good network intrusion detection, IMHO;
instead, one wants to focus on understanding the structure of the application
(so that *is* a language problem, but for a formal language, not a natural
language), and on larger patterns of activity.  (Bro is "event based" rather
than "signature based", and has mechanisms for dealing with traffic in terms
of events rather than just packet payloads.)  It's the signatured-based
approaches that need to look for the zillions of strings.  But I'm adding more
signature-matching capabilities to Bro, as a way to leverage the large
signature libraries developed for the most popular freeware signature-based
system, Snort.

                Vern



reply via email to

[Prev in Thread] Current Thread [Next in Thread]