help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Non-greedy wildcard possible? (Long)


From: Magnus Lie Hetland
Subject: Re: Non-greedy wildcard possible? (Long)
Date: Tue, 18 May 2004 01:01:35 +0200
User-agent: Mutt/1.4i

Hans Aberg <address@hidden>:
[snip]
> >> Normally, this is done by the lexer, sending back say a token
> >> "text". Why is you not doing it that way?
> >
> >Because what tokens are significant (as opposed to "plain text")
> >is described by a user-specified context-free grammar.
>
> Normally this is handled by a look-up table that is accessed by the
> lexer, also in the case when the grammar is not formally
> context-free. By tweaking the lexer and the parser together, one can
> often handle languages that are not formally context-free.

Sounds great, but I don't need to handle languages that aren't
context-free, really. (At least I don't think that's the gist of my
problem.)

I don't mind using a lookup table (e.g. one telling me which tokens
are legal in the current parsing state, so I can skip all text up to
the first occurrence -- that's what I'm probably going to go for) but
that table would have to be generated somehow. I cannot expect the
user to supply this, so I'd have to get it from the grammar; I could
get Bison to output info about its states etc. as a text file, but it
seems to me that it would be better to access the current legal
lookahead tokens during the parsing.

> So if you say you "already have solved the lexer part", that is a
> suspicious remark, because the lexer will in practise depend on the
> limitations of the parser.

Not necessarily. (I have repeatedly explained my position on this in
previous emails; the idea is basically to have the lexer return the
"lowest common denominator" pieces of text, and have the parser put
them together.)

But it does seem like an approach where the lexer gets knowledge about
the current parsing state (i.e. which tokens are legal) is what I want
to go for -- that is also the most similar to my current
implementation in Atox (where, indeed, the first occurrence of a legal
token is used).

> I think you need to describe a little better your language problem.
> Examples? Then, one can determine what techniques you might need.

OK -- I think I basically have my solution (one based on the code I
found in the archives, for discovering the current legal lookahead)
tokens, but here is an example (somewhat contrived, for
brevity/simplicity):
  
  1. Lorem ipsum * dolor sit amet

  Consectetuer *adipiscing* elit. Nulla odio enim, egestas sit amet,
  congue ut, viverra lacinia, nisl. Fusce laoreet, turpis non mattis
  pretium, lorem nibh fringilla ipsum, nec tincidunt lorem lorem vel
  mauris.

  1.1 Integer velit
  
  Fusce erat libero, convallis ut, rhoncus vel, dapibus
  quis, libero.
  
  1. Nam *enim* leo
  
  1.1 Malesuada quis
  
  Feugiat non.

Assume, for the sake of argument, that emphasis is not allowed in
headers. Also assume that numbered lists must have at least two
members, and that two headers can't occur in sequence (a normal
typographical requirement). Then the above could (with a proper format
specification for the current Atox) be parsed into something like:

<doc>
  <h1>Lorem ipsum 2 * 2 = 4 dolor sit amet</h1>

  <p>Consectetuer <em>adipiscing</em> elit. Nulla odio enim, egestas
  sit amet, congue ut, viverra lacinia, nisl. Fusce laoreet, turpis
  non mattis pretium, lorem nibh fringilla ipsum, nec tincidunt lorem
  lorem vel mauris.</p>

  <h2>Integer velit</h2>
  
  <p>Fusce erat libero, convallis ut, rhoncus vel, dapibus quis,
  libero.</p>
  
  <ol>
    <li>Nam <em>enim</em> leo</li>
    <ol>
      <li>Malesuada quis</li>
    </ol>
  </ol>
  
  <p>Feugiat non.</p>
</doc>

Here it is clear that more than simply linear left-to-right scanning
is required to determine what something is -- such as header vs. list
item (as well as grouping list items into lists). This can be done in
a very natural manner with normal parsing techniques, rather than ad
hoc methods (I've tried a few ad hoc rule-based methods too). Now,
whether a piece of text is a header or a list item has consequences
for what "plain text" is (and for the "wildcard" I've been talking
about -- what to skip before the next significant token). In a header,
an asterisk is just an asterisk, but in paragraphs and list items, it
is used as a marker of emphasis. Thus, in one it is skipped (and
treated as plain text) while in the other, it is used as a significant
token.

The two approaches I've been talking about are:

  1. Return an asterisk token either way, and let the parser sort out
     whether it should be shifted into a "plain text" production.

  2. Let the lexer know about the legal tokens (such as a double
     newline in the header) and have return the first occurrence of a
     legal token.

The first of these lets the lexer "play dumb" but would require either
a production for the wildcard that excludes the significant tokens;
writing this statically would probably result in one that is too
strict.

The second (which is the approach I'll probably use, and which is
virtually identical with the current one I've implemented in Python)
requires me to fetch a list of legal tokens each time yylex is called.
I believe I have found code to do that [1].

I'm sorry if I'm not being clear. I've thought a lot about this, so
I'm sure there are lots of hidden assumptions in my descriptions.
Also, this isn't exactly plain vanilla parsing -- I'm trying to do
something I haven't seen done before, and that's always a challenge :]

[1] http://mail.gnu.org/archive/html/help-bison/2002-10/msg00057.html

-- 
Magnus Lie Hetland              "Wake up!"  - Rage Against The Machine
http://hetland.org              "Shut up!"  - Linkin Park




reply via email to

[Prev in Thread] Current Thread [Next in Thread]