help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Wed, 19 May 2004 03:38:48 +0200
User-agent: Mutt/1.4i

Frank Heckenbach <address@hidden>:
>
[snip]
> > This has to be inferred from the fact that the only legal token at
> > that point would be a closing quote -- everything else should be
> > shifted into the verbatim/plain text production.
> 
> Sure. You'll have to do it some way. Set-of-all-tokens -
> set-of-terminating-tokens-for-this-rule -
> set-of-tokens-starting-markup-allowed-here =
> set-of-tokens-to-be-treated-as-plain-text-here. Should be rather
> automatic.

Hm. I suppose so. I was worried that some rules might give me trouble;
for example, if a rule ends with a wildcard, finding out which
terminating token is in effect would require knowledge about the parse
state which cannot be inferred statically (at least as far as I can
see), because the terminating token belongs to another production.

But as long as I stay away from rules that end with wildcards, I guess
the analysis is quite straightforward. (Are there other pitfalls?) And
this was, in fact, the approach I considered earlier.

> (Where I'd do this for the parser(-generation), not the
> scanner(-generation), so the scanner remains static, see below.)

Yes, I agree. This was, indeed, how I planned to do it (a while back).

> > > It gets worse when you use GLR: As long as the parser is split, it
> > > doesn't execute any actions, but of course consumes tokens.
> > 
> > Ah. And I was planning on using GLR, to put less of a burden on the
> > user.
> 
> I'm not advising against it (in fact, I use GLR myself). But to some
> extent it's "all or nothing". GLR doesn't mix well with
> context-dependent lexing,

Right -- that was what I was reacting to above. But as long as I can
reliably analyze the grammar and produce CFG rules for my wildcards,
the parse should be quite clean, and suitable for GLR.

[snip]
> For scanning, that's simple so far: tokens are "*", "'" and [^*']*
> (the last one is your non-greedy .*, relative to the other tokens).

Sure -- I've gotten Flex to do that bit. I have a special token to
represent "anything not matched by one of the explicit tokens", so I
basically have the entire input string carved into tokens.

> > > This example was especially difficult because context-dependent
> > > tokens could overlap. If you have only single characters or so, it
> > > will be easier.
> > 
> > I don't, sadly. I have things like lines of dashes underlining
> > headers, as well as emdashes written "--", in the samples. In general,
> > the user may specify anything.
> 
> Also in the same grammar? So it could be either a header or a word
> and an emdash?

That would be up to the user -- who would quite likely also have to
specify some way of breaking the ambiguity. My current parser is an
LL(1)-like parser that tries things top-to-bottom, making things
defined earlier in the grammar trump those defined later.

Just a couple of thoughts at the end:

  1. Adding some domain-specific assumptions might make the system
     easier to use, and more efficient. For example, I could use a
     two-tiered parsing strategy, one working on the level of blocks
     (paragraphs, headings, list items etc.) separated by "vertical
     whitespace" (with the kinds of blocks being defined by various
     properties such as "starts with a numeric label" or "ends with a
     line of dashes"), and the other working with inlines inside the
     blocks, with different grammars for each block type.

  2. I just discovered the GLR* algorithm by Lavie and Tomita -- a
     generalization of GLR that allows the parser to skip unparsable
     parts of the input, and return the parse with the least number of
     skipped tokens. This is *dangerously* close to mind-reading, and
     might work in my case -- at least, it's very cool ;)

More info on GLR* here: http://citeseer.ist.psu.edu/lavie93glr.html

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]