Re: Non-greedy wildcard possible? (Long)

help-bison
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Non-greedy wildcard possible? (Long)

From:	Frank Heckenbach
Subject:	Re: Non-greedy wildcard possible? (Long)
Date:	Wed, 19 May 2004 03:13:02 +0200
User-agent:	semail 20040101
Magnus Lie Hetland wrote:

> Frank Heckenbach <address@hidden>:
> >
> > Magnus Lie Hetland wrote:
> > 
> > > Just an example to show why a fixed set of markup tokens (that will
> > > end the plain text) won't do:
> > > 
> > >   Foo *bar 'baz *fee* fie' foe*.
> [snip]
> > In this case, I'd just make the asterisk a different token
> > (context-independent) which is parsed as normal text inside the
> > verbatim rule and as markup otherwise. (This is still CFG, of
> > course.)
> 
> *Exactly!* That's what I've been saying all along.
> 
> The problem has then been specifying the grammar of such runs of
> "plain text", i.e., how they are terminated. The user specifies only
> the non-plain-text part of the grammar -- in this case, the quotes and
> asterisks, and how they are composed into spans of different types.
> However, the user does *not* specify that the asterisk is to be
> treated as plain text between the quotes. This has to be inferred from
> the fact that the only legal token at that point would be a closing
> quote -- everything else should be shifted into the verbatim/plain
> text production.

Sure. You'll have to do it some way. Set-of-all-tokens -
set-of-terminating-tokens-for-this-rule -
set-of-tokens-starting-markup-allowed-here =
set-of-tokens-to-be-treated-as-plain-text-here. Should be rather
automatic. (Where I'd do this for the parser(-generation), not the
scanner(-generation), so the scanner remains static, see below.)

> > It gets worse when you use GLR: As long as the parser is split, it
> > doesn't execute any actions, but of course consumes tokens.
> 
> Ah. And I was planning on using GLR, to put less of a burden on the
> user.

I'm not advising against it (in fact, I use GLR myself). But to some
extent it's "all or nothing". GLR doesn't mix well with
context-dependent lexing, OTOH some tricks for which I used
context-dependent lexing before have become superfluous with GLR.

> Consider the following format (described here in natural language)
> which might be specified in Atox:
> 
>   - Emphasis begins with an asterisk and ends with an asterisk.
>     Emphasis may contain verbatim text.
> 
>   - Verbatim text begins with a single quote and ends with a single
>     quote.

For scanning, that's simple so far: tokens are "*", "'" and [^*']*
(the last one is your non-greedy .*, relative to the other tokens).

> > This example was especially difficult because context-dependent
> > tokens could overlap. If you have only single characters or so, it
> > will be easier.
> 
> I don't, sadly. I have things like lines of dashes underlining
> headers, as well as emdashes written "--", in the samples. In general,
> the user may specify anything.

Also in the same grammar? So it could be either a header or a word
and an emdash?

it
--

If tokens can be anything and can overlap (say "<--" and "-->" can
be tokens) and you have no further static information about the
contexts where they're used (unlike, say, comment delimiters in
HTML), then according to my approach you might have to make each
character that is part of a (real) token to be a pseudo token, i.e.
when the lexer sees "<-->" it returns '<', '-', '-', '>'. Then the
parser has rules like:

arrow_left: '<' '-' '-';

arrow_right: '-' '-' '>';

Of course, one could do a little better in some cases by analyzing
the tokens (here: make "--" a single token).

But basically, this way shifts most of the burden to the parser. The
lexer will become quite simple then. This is then basically
"scannerless parsing" as described in
<http://www.cs.uu.nl/people/visser/ftp/BSVV02.pdf>.

As noted there, this usually requires a full CFG parser such as GLR
(Tomita). The disambiguation rules described in the paper are not
available in bison as such, but in a discussion I had with Hans
Aberg we came to the conclusion (though without formal proof) that
it should be possible to rewrite them into static precedences and/or
extra nonterminals.

Of course I should add that I haven't done this in such an extreme
form. I have a few, and statically known, critical tokens. I don't
know if it will be very inefficient to have one token per character.
OTOH, in most practical grammars, probably letters etc. will not be
special tokens, or at least only in some combinations (keywords), so
in the common case, words or even sentences can become single tokens
(with a little more effort in the scanner again), and the
inefficiencies, if any, might not be practically relevant.

> [snip]
> > OK, here we have a more complicated example. This may be helpful to
> > get to the root of the problem. IIUIC, this is a line with emphasis
> > and a second (strange) line:
> > 
> > Foo *bar*
> > ---+-----
> > 
> > And this is a heading without emphasis and two asterisks in the
> > text:
> > 
> > Foo *bar*
> > ---------
> 
> Yes -- assuming that this is how the user has set it up. (The format
> is 100% user-specifiable -- that's really the reason for all these
> difficulties. Hard-coding a format as a grammar would be easy.)
> 
> > If the heading and normal line require substantially different
> > parsing (as the emphasis does already -- if it's only this, it could
> > be kludged around, but I suppose there's more and you really need to
> > parse it differently), I see two ways of doing it: 1. with
> > look-ahead in the lexer (quite hairy, similar to content-dependent
> > lexing, not recommended), 2. using GLR parsing.
> 
> Yes, that was what I was hoping for -- but I haven't been able to make
> it work.

Have you tried it with GLR? So far it seems like a nice nontrivial
example that should work well (without dprec).

> > If you make sure that the language is actually unique (e.g., your
> > lexer gives a special token for `---' lines which is only parsed in
> > headings),
> 
> Hm. I can't really guarantee that. For example, the following might be
> a legal table:
> 
>   ----------------------------
>   Foo   Bar    Baz   Frozzbozz
>   ============================
>   1     2      3     4
>   5     6      7     8
>   ----------------------------

OK, so it also occurs there. The point it that sooner or later the
parsing becomes unique again. Otherwise the language is really
ambiguous. This might happen on silly user input, but then you can
just pass on the GLR error message, because it is in fact a user
error.

> > then you might not need dprec etc. The scanner will fork, one branch
> > will run into a parsing error sooner or later (because one branch
> > will not acecpt the `---' token and the other one will require it),
> > and the other branch will continue. This is the "most harmless" way
> > of using GLR.
> 
> I guess. But how would I specify the rule leading up to the
> underlining? The way I've been trying to do it, I've specified it as a
> rule matching anything at all -- and if I do that, I need for the more
> specific rule to trump it somehow.
>
> On the other hand, I could simply say that it consists of all tokens
> except a run of dashes (for example) -- but how am I to find that out
> from the parser state, and how am I to signal it to Bison, without
> statically rewriting the grammar? (I could do that, I suppose, but I'm
> worried that it would be overly strict, and miss some legal parses.)

I'd put this in the grammar. Roughly (and untested):

main: plain | heading;

plain: | plain p_item;

p_item: item | emph;

emph: '*' items '*';

items: | items item;

heading: h_items dashline;

h_items: | h_items h_item;

h_item: item | '*';

item: other_character;

So the parser will fork, one branch parsing plain (so handling
emph), the other one parsing heading (treating '*' as a normal
character).

Now if it's in fact a heading, the first branch will choke on the
dashline because there's no way for it to accept it. If it's no
heading, the second branch will fail sooner or later (say, at the
next paragraph, if item doesn't include paragraphs) because it
requires the dashline.

Then bison will do all the actions from the surviving branch and all
is fine (as if bison had magically guessed the right branch in the
beginning).

Alternatively, you might be able to write a shorter grammar using
dprec (for the dashline rules). But since I've never used (needed)
dprec myself so far, I won't say more about it. You might want to
experiment ...

> > > OK. Maybe so. But finding the first occurrence of a disjunction of
> > > strings (or patterns) can be done in linear time; once you have the
> > > first occurrence, you also know the length of the span that *didn't*
> > > match any of the patterns.
> > 
> > This might be a viable solution. AFAIK, flex doesn't support it,
> 
> I'm not sure what you mean. This is exactly what Flex does -- it finds
> the first occurrence of one of its tokens.

Not really. It has to match an "input item" (whether it becomes a
token for the parser depends on the flex action, of course) at the
current position. flex has no concept of "garbage" between tokens.

What you could do is make a flex rule `.' at the very end. Due to
the longest-matching rule and the first-applicable rule for
equal-length matchings, this will only match if no real tokens
apply.

But it will be a single character only. Of course, for efficiency
(see above), you could assemble a sequence of those items, say in a
function sitting between the flex scanner and yyparse. (Whereas `.*'
as the last rule won't work, since it will match a whole line which
will often be longer than real tokens and thus take precedence.)

So in this sense, the answer to the original subject is that `.' as
the last flex rule is non-greedy.

> As long as I know which tokens are legal, Flex's functionality works
> just fine (if I go for the LL(1)-style plain text rule). The problem
> is finding out which are legal in a given state (which it seems is
> possible) -- and dealing with what you mentioned about Bison  fetching
> several lookahead tokens, which would foul things up.

As I made clear, I'm no fan of context-dependence in the scanner. So
the scanner (IMHO) should scan to the lowest common denominator
(which may be as bad as fragemented tokens, see above) and leave the
rest to the parser (which is context-dependent anyway and has more
powerful mechanisms available, though it will probably be somewhat
slower).

> Oh, well. I guess I could always use Flex for tokenization, and adapt
> my current LL(1)-like parsing code to run on top of it -- that still
> ought to speed up things a bit from the current, even if I don't get
> the full speed of Bison.

BTW, I wonder how you handle cases such as the heading above in
LL(1).

Finally, please note that these are just assorted suggestions. I
haven't done such an abstract thing yet, and it's a bit hard to work
with these meta-examples (which, of course, arise naturally in a
meta-language, but it's still hard), so you'll have to find out what
will work and what not.

Frank

-- 
Frank Heckenbach, address@hidden
http://fjf.gnu.de/
GnuPG and PGP keys: http://fjf.gnu.de/plan (7977168E)
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Non-greedy wildcard possible? (Long), (continued)
- Re: Non-greedy wildcard possible? (Long), Hans Aberg, 2004/05/16
Prev by Date: Re: Non-greedy wildcard possible? (Long)
Next by Date: Re: Non-greedy wildcard possible? (Long)
Previous by thread: Re: Non-greedy wildcard possible? (Long)
Next by thread: Re: Non-greedy wildcard possible? (Long)
Index(es):
- Date
- Thread