help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Laurence Finston
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Mon, 17 May 2004 13:41:30 +0200 (MEST)

On Mon, 17 May 2004, Magnus Lie Hetland wrote:

>
> [snip]
> > I suspect that a rule for "anything else that doesn't fit into the
> > grammar" is very likely to cause lots of conflicts.
>
> Certainly. My hope was to solve that with some form of dynamic
> precedence -- but I can't find any reasonable way of doing that with
> Bison.
>

I don't think "dynamic" is a suitable term with reference to the
grammar rules.  It is possible for `yylex()' to determine
dynamically what tokens to return to `yyparse()', though.

> [snip]
> > For what it's worth, I finally abandoned the use of Flex and have
> > never regretted it for a moment.
>
> I need to find occurrences of regular expressions in the text (or some
> similar specification of patterns in the text, that can be supplied by
> the user). What do you use? Another regexp library? Or do you use
> grammar rules on single characters?
>
> [snip]
> > I've found that it doesn't really pay to use Flex when one is using
> > Bison, since in this case, the scanner mostly just returns tokens to
> > yyparse().
>
> Sure -- but you still need to scan, right? How do you do that? (Am I
> missing some scanning feature in Bison? Are you suggesting writing
> context-free grammars for the tokens?)
>

`yylex()' is the scanner.  The method I'm using doesn't use regular
expressions---at least, I don't think of it that way.  I suppose it
could be described in terms of regular expressions, though.  I'm
implementing a Metafont-like grammar based on the description in
Backus-Naur form in Don Knuth's _The METAFONTbook_.  On pages 50--51
he describes the scanning rules.  Apart from a few characters that are
handled differently, all characters are divided into categories.  A
`symbolic token' consists of one or more characters from a single
category.  The scanner collects characters into a string representing
a `symbolic token' as long as they belong to single category.  If it
scans a character from
a different category, it terminates scanning.  For some categories, it
simply sets the semantic value of the token (if appropriate) and
returns the appropriate `int' value, i.e., the token itself, to
`yyparse()'.  For other categories, it looks up the string in a table
to determine what token to return and its semantic value.  If it's not
in the table, it proceeds as in the first case.

> [snip]
> > An ambiguous context-free grammar is invalid.
>
> I've never heard any definition of context-free grammars that preclude
> ambiguity. In fact, one of the main features of general parsers that
> can parse any context-free grammars (such as the Earley parser) is
> that they'll find all possible parses of such ambiguous grammars.
>
> Or am I wrong?
>

I'll leave this to the computer scientists on this list to thrash out,
if they care to.  From my point of view, if my grammar is ambiguous,
Bison won't be able to generate a parser, unless its rules for
resolving ambiguity are able to cope with it.  I don't think it would
help me to _find_ all possible parses of an input, I want it to
actually be parsed, and I want its actions to produce correct results.
Therefore, my input (in combination with Bison's
default behavior) must unambiguously specify a single set of rules for
parsing.

> > I'm not familiar with Python, so I don't know how it's non-greedy
> > closure operator works.
>
> It's not particular to Python. It's used in Perl too -- and I would
> suspect many other places too. It's very easy; for example, the
> pattern .*?y (a non-greedy repetition of a wildcard followed by a
> 'y') would match the first "xy" in the string "xyxy", while the greedy
> version, .*y, would match the entire string.
>
> Very simple, efficient, and useful. I'm basically looking for an
> equivalent in context-free parsing.
>

If the terminator is a single character then the problem is fairly
simple.  The non-greedy version above is equivalent to `[^y]*y' in
Emacs' regexp syntax (the one I'm most familiar with).  However, it
would be difficult and maybe impossible to specify something like
`.*^(abc)', meaning "any string not ending in `abc'".
Maybe this is easy and efficient in Python, I don't know.  I believe
that any features requiring backtracking are likely to be slow, though.

>
> > at worst, one lookup for every character scanned.
>
> That would require a very inefficient implementation indeed.
>

I think this depends on what characters can occur in input items.
(I say "input items" to distinguish them from the "tokens"
returned to `yyparse()' by `yylex()'.)
If there's no easy way to recognize separatations, e.g., if
blank space is allowed within input items, and if input items
can be prefixes of other input items, then this might be
necessary.

> > I think it might be easier to scan the entire output once and insert
> > markers at appropriate places rather than to try to handle this
> > problem on the fly.
>
> But, as I've tried to state (repeatedly), this won't work. It can't
> work. For example, a headline may not have a sub-production allowing
> for emphasis, so if I find an asterisk in it, that must actually be an
> asterisk, and not an emphasis marker (as it would be in a normal
> paragraph). This could not possibly be determined by a simple scanner,
> because the headline-status of a piece of the document depends on a
> full parse.
>
> This is the crux of the discussion; if it had been as simple as you
> suggest (simply scanning through the document, looking for proper
> tokens) I'd have the problem solved long ago, and wouldn't have to ask
> here :)
>
> (It may be, however, that I'm misunderstanding your suggestion, of
> course.)

Theoretically, you could insert the markers and decide on the second
pass how to handle the asterisk.

>
> [snip]
> > What's the difference between the last exclamation point and the first two?
>
> The fact that the last one comes last -- and that is the point.
>
> > If there is one, it must be specified in the rules.  If not, then it
> > won't be possible for the scanner to break this string into the two
> > tokens "Foo! Bar! Baz" and "!".
>
> And it certainly shouldn't -- which has been my point all along. As
> long as the properties involved are context-free, the scanner won't be
> *able* to. (This wasn't the best example of that, though :)
>
> Thus, it would be up to the parser to join the various "Foo" and "!"
> tokens of the first part into a single production.

In what sense is it "last"? Because only whitespace follows it up
to the next newline?  If `yyparse()' is to recognize it as different
from the other two, then `yylex()' must return a different token to
represent it.

>
> [snip]
> > I think markup is by definition explicit.
>
> Well, feel free to give what I'm doing another name, then. I would
> have thought the term "implicit markup" would be clear enough. The
> point is that I'm trying to elicit document structure from plain-text
> documents and add explicit markup (in the form of XML).
>
> (Or -- I'm not just trying; I'm already doing it. I'm just trying to
> get Bison on my team ;)
>

> > I also think that it might be easier just to mark up a document
> > rather than try to define rules for parsing a sparsely marked-up
> > file.
>
> It would certainly be easier on the parser. I, however, would like to
> make the job easier for the author.
>
> I've been using markup formats like LaTeX and XML (and other ad hoc
> formats) quite a lot. They're fine, but there is (IMO) a need for
> simpler, sparser, more plain-text-like markup. Many have argued for
> this, and many specific implementations exist. My idea was to create a
> generic engine for this sort of thing, instead of simply inventing my
> own format, like everyone else.
>
> (More about this, and references to lots of specific formats, in the
> docs at atox.sf.net.)
>

To quote Don Knuth, "Computers are good at following instructions;
they're not good at reading your mind".

I too find LaTeX's style of markup too verbose and rigid.  I haven't
used XML, but I feel the same way about HTML.  I like TeX's style of
markup, and it's the markup language I'm most familiar with.  TeX is
also the reason why I suggested using a macro processor.  In general,
I find that it pays to look at the way Knuth solves a problem.

Laurence





reply via email to

[Prev in Thread] Current Thread [Next in Thread]