help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing a language with optional spaces


From: Christian Schoenebeck
Subject: Re: Parsing a language with optional spaces
Date: Tue, 14 Jul 2020 13:31:04 +0200

On Montag, 13. Juli 2020 07:56:57 CEST Akim Demaille wrote:
> > Le 12 juil. 2020 à 19:47, Christian Schoenebeck 
<schoenebeck@crudebyte.com> a écrit :
> >> Additionally,the programmer must fully understand how
> >> and when the parser handles tokens, otherwise subtle bugsmay be
> >> introduced.
> > 
> > I don't see what they exactly had in mind here for that to claim.
> 
> Bison is "lazy" wrt fetching tokens, in the sense that if it can decide
> what the next action is without looking at the lookahead, it won't fetch
> one.  Besides, when it can, it merges "groups of similar reductions" under
> the label of "default reduction".
> 
> So on situations that are similar, the parser may have fetched the lookahead
> before the reduction(s), or after.  Hence if you want tight coupling bw the
> scanner and the parser, you need to be aware of this and know when this
> will happen.
> 
> That's what lr.default-reduction controls.
> (https://www.gnu.org/software/bison/manual/html_node/Default-Reductions.html
> )

Ok, that's specific to LALR.

For the 'married' parser, in the form imagined by me, there would be no 
lookahead token, as a lookahead token implies a context-unaware scanner.

Instead, when the parser would get to a state where it had to decide between 
reducing or shifting (depending on a next token), it would not decide and fork 
the parser state instead, in order to allow context-specific tokenization.

> > Note that the latter example would of course [require to] work differently
> 
> > than the common greedy and context unaware scanner:
> Yes, indeed.  That's something which Joel detailed in his PhD.  Note
> though that PSLR targets deterministic parsing.  Bison would (first)
> target that, not GLR.  But contributions are always welcome :)

The longer I think about it, the more I wonder whether it wouldn't make sense 
to make a clear and radical cut with the past when developing a 'married' 
parser to allow unfolding its full potential.

I mean I look at what you are working on Akim, and AFAICS the majority of time 
you have to spend on maintaining (i.e. not breaking) backward compatibility 
for ancient use cases. Most of Bison's conceptional design and internals are 
still based on requirements and opinions from around 1980 which simply no 
longer exist or proofed wrong.

Besides of what we already discussed, I think it would make sense to get rid 
of a bunch of other things that Bison still retains:

1. No longer multiple parser types, only one: a somewhat GLR-alike parser like 
   discussed as real superset of LALR, but without lookahead token (so not 
   really LALR based under the hood). Theoretically there is no disadvantage 
   by doing so, because if a developer feeds it with a LALR(1) capable
        grammer, it would still end up having the same parsing 
   complexity=efficiency as an old-school LALR would do, probably just with 
        slightly higher memory footprint.

   Advantages though: much more user friendly and powerful, and a lot less 
   code to maintain.

2. No longer raw C tables as pregenerated parser and scanner tables, instead 
   of such C-tables: an intermediate, dynamic in-memory model with high-level 
   API allowing to access and modify grammar and patterns on-the-fly.

   Advantage: it would allow self-extensible languages (e.g. an input that 
   adds keywords, patterns, drops or adds grammar rules while being parsed).
        And implementation specific things would finally clearly be isolated by 
   introducing such an API -> Easier to maintain.

        Disadvantages: Probably none, except of consuming slightly more memory 
for 
   meta information.

3. No longer aggressive compression of parser states (which is somewhat 
   already implied by [2.]), because such compressions lead to important 
   information loss when it comes to grammar debugging or auto completion 
   features and other user relevant purposes. The motivation that lead to such 
        compressions (very low RAM space) no longer exist in like >99% of use 
cases 
   today.

So these concepts are probably too radical for trying to still fit them into 
Bison's current (3.x) design.

> > Akim, question regarding Bison's GLR parser ...
> > 
> > https://www.gnu.org/software/bison/manual/html_node/Simple-GLR-Parsers.html 
> > :
> >> In general, a GLR parser can take quadratic or cubic worst-case time, and
> >> the current Bison parser even takes exponential time and space for some
> >> grammars. In practice, this rarely happens, and for many grammars it is
> >> possible to prove that it cannot happen.
> > 
> > ... why isn't it O(n^3)? Or more specifically, which common GLR fork
> > optimization(s) are yet missing?
> 
> This is fuzzy in my memory :(  It might refer to what follows.
> 
> Parsers such as SGLR have stacks which are graphs, while Bison's is
> always a tree.  All GLR parsers manage to share the past of concurrent
> stacks, they don't have a set of separate stacks.  Bison's GLR stacks,
> of course, share their common past, it's stored only once.  I believe
> SGLR is also able to share the common "future", so their stacks are
> no longer trees.  This allows SGLR to keep decent performances when parsing
> ambiguous grammars and you want to fetch the resulting parse forest.

Ok, I read it as Bison's current GLR parser probably not having seen much 
attention in the past.

BTW, from my perspective a set of parser stacks (with shared past) is IMO 
still a tree, however a tree with 3 dimensions instead of 2. A 'graph' for me 
is something much more generic, as a graph allows much more complex structures 
(with less restrictions) like traversing to any other state, which is not 
possible in a tree.

Best regards,
Christian Schoenebeck





reply via email to

[Prev in Thread] Current Thread [Next in Thread]