help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Laurence Finston
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Wed, 19 May 2004 11:45:10 +0200 (MEST)

On Wed, 19 May 2004, Frank Heckenbach wrote:

> Laurence Finston wrote:

>
> > >
> > I agree that regexps can be handy. I just don't need them for the scanner 
> > and
> > parser that I'm working on now.
> > As far as Magnus' problem is concerned, I've tried to explain why I think 
> > his
> > approach won't work.  In addition, the Flex manual states that the use of
> > trailing context in the regexps will slow the scanner down considerably.
>
> Only variable trailing context, which means that the length of both
> the regex and the context is variable.
>
> > There
> > is no away around this, it is implicit in the nature of regular expression
> > matching.
>
> Perhaps you got confused by the flex terminology. Trailing context
> means context given with `/foo', and only variable length context is
> dangerous. This is in no way implicit, and can usually be avoided
> (I've always been able to so far).

I just checked the Flex manual, and at
the beginning of Chapter 17, "Performance Considerations", on page 45,
it states that "arbitrary trailing context" and "pattern sets that
require backing up" as the second and third-most expensive
"options/actions" that degrade performance.  "beginning of
line operator" is the eighth-most expensive.  What I was thinking of
was the problem of matching a regexp like
".*^(`a' || 'bb' || 'ccc' ...)" standing for "any string not ending in
`a', `bb', `ccc', etc.  This seems to be what Magnus wants, where the
list "a, bb, ccc" is specified by the user and can vary depending on
context.  This will be expensive no matter what, not just because of
the backing up, but because of all the table lookups.

>
> (Flex matches the regex and the context at once, concatenated, and
> then has to find the start of the context, i.e. the end of the
> actual pattern. If either length is constant, it can easily find it
> by counting characters from the beginning or ending. Only if both
> are variable-length, it has to do more complex matching.)

No matter what, it will still have to compare pairs of characters and
if it has to back up, it will compare one or more characters from the
input more than once.  Depending on how much pattern matching must be
done, how complex the patterns are, and the length of the input, this
could lead to a serious degradation of performance, and probably will,
in my opinion.

> > If my input was `point p;' and my rule was
> > `<type> <variable> <semi-colon>', the scanner returned
> > `<type>' and `<variable>', but the semi-colon was lost.  If my input was
> > `point p ;', then it wasn't.  The first input also worked if the rule
> > was changed to `<type> <variable_with_semi-colon>'.  I fiddled quite a bit
> > with the regexps and I couldn't get it to work.
>
> Well, I don't know what's going on there, but normally this should
> work with any reasonable version of (f)lex.

I thought it should, too.

>
> > I would also have thought
> > that ambiguity in the rules would imply that there could be more than one 
> > path
> > through the rules for one or more inputs. If the result of parsing can be 
> > said
> > to be a "sentence" in the language defined by the syntactic rules and the
> > semantic actions, then the choice of path through the rules would affect 
> > both
> > the syntax and the semantics of the "sentence".
>
> Especially the semantics.

Why "especially"?

> This book is quite nice: Dick Grune et al.: "Parsing Techniques, A
> Practical Guide" <http://www.cs.vu.nl/pub/dick/PTAPG/>.
>

Thanks for the tip.

Laurence




reply via email to

[Prev in Thread] Current Thread [Next in Thread]