help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Non-greedy wildcard possible? (Long)
Date: Sat, 15 May 2004 17:41:55 +0200
User-agent: Mutt/1.4i

Hi!

I'm working on a project (atox.sf.net) where I want to use Bison to
add markup to plain text. I need a rule to represent the plain text
segments (between the markup tokens) and that's proving a challenge.

I have already solved the lexing part, so the entire text is segmented
into tokens, with a special "between"-token for any text that isn't a
markup token. The problem is that in some context, something like a
newline might be a significant part of the markup, while in other
contexts it may not be.

Basically, what I need is some form of non-greedy wildcard rule --
something that will match any token up to the first occurrence of a
legal rule.

The simple (but possibly messy, implementation-wise) solution would be
to analyse the grammar and create specific wildcard rules for each
plain text segment, allowing only the tokens that can't start any
legal rules. This would be sort of an LL(1) way if doing it, and it
would require some possibly quite hairy grammar analysis
(especially with empty rules). One possibility (if I could figure out
how to do it) would be to use the LR table in Bison to check which
tokens could start a rule in any given context.

*However* -- what I'm *hoping* to be able to do is this: Use a GLR
parser, define a generic wildcard token that will match any token,
somehow give it a low priority (it doesn't seem like either normal
precedence or %prec works quite the way I need it to) so that the
parser will know when to terminate the wildcard.

Let's say, for example, that asterisks are used to indicate emphasis
Let's say the following is the input:

  Fee fie *foe* fum.

The grammar would match a series of chunks (either code or emphasis)
interspersed with wildcards. When it hits the first asterisk, it
wouldn't know what to do (it could just be a lone asterisk -- a second
one would be needed to make an emphasis chunk). So the GLR comes into
play. One parser keeps shifting, while one reduces the wildcard before
shifting the asterisk.

It *seems* to me that this approach might work, but I can't get it
quite right. I've had some success with some simple examples, but once
I start writing bigger grammars I get lots of ambiguity, conflicts and
even run out of parser stack (even though I have left recursion).

I can understand the ambiguity (how would the parser know which one to
use) -- is there any way to resolve that? (Can %dprec be used here?)
Is this approach possible at all, when I can't assume much about the
grammar? (It will be user-supplied...)

Here is a simple example I've been toying with (after my more complex
tests started falling apart ;) -- with the actions removed for
readability:

%%

document : fill chunks SEMI
         ;

chunks   : /* empty */
         | chunks emph fill
         ;

emph     : STAR fill STAR
         ;

fill     : /* empty */
         | fill token
         ;

token    : FOO | BAR | SPACE | STAR
         ;

%%

This is a sample input:

  foo bar foo *foo bar* foo

Why doesn't this work? Is there any way of making it work? Any other
general ways to approach this problem? Any way to determine the
possible lookaheads in a given situation (for the LL(1) wildcard; i.e.
in this case I'd be looking for the star)? (I've seen some postings
about the latter for tab completion, but it seems a bit
complicated...)

Sorry that this posting was a bit long. If anyone has the patience to
read through it and give me some pointers I'd be very grateful.

Thanks,

- M

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]