help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scanner takes a long time to build.


From: Hans Aberg
Subject: Re: Scanner takes a long time to build.
Date: Sun, 28 Jul 2002 11:26:27 +0200

At 15:08 -0700 2002/07/27, Vern Paxson wrote:
>> The way I interpret it, the algorithm builds the DFA states as the parsing
>> goes along
>
>Flex builds NFA states as it parses the rules.  It then converts the entire
>set into a DFA and spits out a set of tables representing the DFA.

That's also how the algorithm is described in Aho et al.

One can do the same thing also for LR(1) parsing, which I think is
described in the online Parsing Techniques book at
<http://www.cs.vu.nl/~dick/PTAPG.html>.

>So it could instead spit out tables representing the NFA, along with
>code that would compile the NFA into a DFA at run-time, as needed.

Yes, that is how I figured it.

>> One could also think of Flex generating NFA/DFA hybrids: I just read in
>> comp.compilers that one should use a NFA if one needs to keep track of
>> parsing position relative the grammar, like when the patterns has groupings
>> that should be extracted for substitution. For example, a pattern
>>   "begin"[.*]"end"  { yylval.text = \1; return GROUPING; }
>> where \1 would refer to the matched text between "begin" and "end" then.

>(The above example is trivial,
>since you know it starts 5 characters from the beginning and ends 3 from
>the end, but if either is variable-length and can also match text in the
>group,
>then it gets tricky.)

Right: I just wanted to illustrate what I meant, and did not find it worth
the effort to figure out a more complicated, but non-trivial, example, say
with regular expression parenthesizes. In addition, I would then have to
introduce more notation, which also did not seem to be worth the effort.

>I don't believe NFAs actually help with this.  To find non-trivial groupings
>you need additional scan-time information.

I got the impression that it would for some reason be simpler to add that
additional information with a NFA: I think that when constructing a
deterministic (finite or pushdown) automaton from the non-deterministic
one, one may merge states in a way making it impossible to distinguish that
added information.

>> >My main current project these days is Bro, a network intrusion detection
>> >system.
>>
>> So it isn't "Big Bro" then. :-)
>
>That's why I chose the name, actually - as a reminder that such tools
>have major privacy implications.

So when you have updated to add some major features, you can add the "Big".

Strictly speaking I think though that George Orwell's "Big Brother" is not
only an authority monitoring you all the time, but also doing that without
itself being checked.

Thus, if you add some features where its monitoring can be independently
checked, it's no longer "Big Bro". :-)

>> It is interesting though that the other fellow wanting the feature was
>> writing on an anti-spam tool. So it could happen that these "natural
>> language pattern recognition" tools require grammars that may blow up
>> exponentially in a straightforward DFA translation.
>
>Two points.  First, the main application where this came up for me is
>simply looking for fixed strings (but a whole lot of them) in packets or
>byte streams - not a language problem at all, there's no syntax.  Second,
>this is actually not the essence of good network intrusion detection, IMHO;
>instead, one wants to focus on understanding the structure of the application
>(so that *is* a language problem, but for a formal language, not a natural
>language), and on larger patterns of activity.

I do not see from your description if your program is looking for human
created strings, or computer generated strings.

But the reason it might show up in the context with human created strings,
is that humans, unlike computers, are excellent in recognizing different
structural patterns. Thus, humans tends to use a lot of different
structures depending on context, which in a computer search program might
show up as complicated grammars. By contrast, when creating computer
generated string, one often has the luxury to use more sparse patterns (as
it then becomes easier to make computer reading programs), so grammars for
parsing those would then by choice be more simple.

  Hans Aberg





reply via email to

[Prev in Thread] Current Thread [Next in Thread]