help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Hans Aberg
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Tue, 18 May 2004 21:35:13 +0200

At 01:01 +0200 2004/05/18, Magnus Lie Hetland wrote:
>  1. Lorem ipsum * dolor sit amet
>
>  Consectetuer *adipiscing* elit. Nulla odio enim, egestas sit amet,
>  congue ut, viverra lacinia, nisl. Fusce laoreet, turpis non mattis
>  pretium, lorem nibh fringilla ipsum, nec tincidunt lorem lorem vel
>  mauris.
>
>  1.1 Integer velit
>
>  Fusce erat libero, convallis ut, rhoncus vel, dapibus
>  quis, libero.
>
>  1. Nam *enim* leo
>
>  1.1 Malesuada quis
>
>  Feugiat non.
>
>Assume, for the sake of argument, that emphasis is not allowed in
>headers. Also assume that numbered lists must have at least two
>members, and that two headers can't occur in sequence (a normal
>typographical requirement). Then the above could (with a proper format
>specification for the current Atox) be parsed...

What is the Atox? A dynamic language specification? Then perhaps you might
make use of a dynamic parser, such as an Early parser. Here is a reference;
perhaps you can get a better one in the newsgroup comp.compilers.
  DINO (simple language with Early parser)
    http://cocom.sourceforge.net/dinoload.html
  The earley parser documentation is on
    http://cocom.sourceforge.net/ammunition-13.html
    http://cocom.sourceforge.net/index.html
Some other references that might be relevant:
  http://www.tinycc.org/
  the GNU lightning library http://www.gnu.org/software/lightning/

>... into something like:
>
><doc>
>  <h1>Lorem ipsum 2 * 2 = 4 dolor sit amet</h1>

I am not sure how * is translated into 2 * 2 = 4; typo?

Some inputs, though:

It seems that you identify blocks by newlines. You might decide the lexer
to recognize that, so that \n and \n\n+ are different tokens.

Then, headers and such may require special lexer start conditions (= lexer
contexts), just in order to get around the LALR(1) limitation of Bison.
>The two approaches I've been talking about are:
>
>  1. Return an asterisk token either way, and let the parser sort out
>     whether it should be shifted into a "plain text" production.
>
>  2. Let the lexer know about the legal tokens (such as a double
>     newline in the header) and have return the first occurrence of a
>     legal token.
The choice would depend on whether the LALR(1) can cope ith the grammar or
not. If 1. works, use it; otherwise pass to 2.
If you want:
>  1. Nam *enim* leo
>
>  1.1 Malesuada quis
to be translated into a list, and not headers, because you know that
headers are already exhausted as possibility, then a way to resolve that is
to set special context variables telling which header nesting there is.

Here is a suggestion: When the lexer starts reading a document, it expects
header "1." to appear, and if it appears, it will return the token "header"
with value 1. If the lexer then sees a "1." appear after the header "1."
has been seen, the lexer will instead return the value "list" with value
"1.".

Conditions such as a list must have at least two elements can be hard to
catch, or make practical use of, because it may require quite some
lookahead for it to see it. If you want to make sure that the list
numbering is correct, that will be checked semantically, in the parser
actions.

  Hans Aberg






reply via email to

[Prev in Thread] Current Thread [Next in Thread]