[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Submission for manual
From: |
Paul Eggert |
Subject: |
Re: Submission for manual |
Date: |
Mon, 21 Jun 2004 13:57:45 -0700 |
User-agent: |
Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux) |
Frank Heckenbach <address@hidden> writes:
> I've applied your other comments.
Thanks for that nice contribution to the manual. I made some minor
editorial changes (e.g., simplifying the example grammar a bit, using
two spaces after sentence-ends) and merged it into the Bison manual.
Here's the patch that I installed. Please let me know if you see any
problems with it.
2004-06-21 Frank Heckenbach <address@hidden>
* doc/bison.texinfo (Simple GLR Parsers): New section.
Index: doc/bison.texinfo
===================================================================
RCS file: /cvsroot/bison/bison/doc/bison.texinfo,v
retrieving revision 1.125
diff -p -u -r1.125 bison.texinfo
--- doc/bison.texinfo 21 Jun 2004 20:20:30 -0000 1.125
+++ doc/bison.texinfo 21 Jun 2004 20:51:59 -0000
@@ -135,7 +135,8 @@ The Concepts of Bison
a semantic value (the value of an integer,
the name of an identifier, etc.).
* Semantic Actions:: Each rule can have an action containing C code.
-* GLR Parsers:: Writing parsers for general context-free languages
+* GLR Parsers:: Writing parsers for general context-free languages.
+* Simple GLR Parsers:: Using GLR in its simplest form.
* Locations Overview:: Tracking Locations.
* Bison Parser:: What are Bison's input and output,
how is the output used?
@@ -381,7 +382,8 @@ use Bison or Yacc, we suggest you start
a semantic value (the value of an integer,
the name of an identifier, etc.).
* Semantic Actions:: Each rule can have an action containing C code.
-* GLR Parsers:: Writing parsers for general context-free languages
+* GLR Parsers:: Writing parsers for general context-free languages.
+* Simple GLR Parsers:: Using GLR in its simplest form.
* Locations Overview:: Tracking Locations.
* Bison Parser:: What are Bison's input and output,
how is the output used?
@@ -860,6 +862,208 @@ will suffice. Otherwise, we suggest
address@hidden
@end example
address@hidden Simple GLR Parsers
address@hidden Using @acronym{GLR} in its Simplest Form
address@hidden @acronym{GLR} parsing, unambiguous grammars
address@hidden generalized @acronym{LR} (@acronym{GLR}) parsing, unambiguous
grammars
address@hidden %glr-parser
address@hidden %expect-rr
address@hidden conflicts
address@hidden reduce/reduce conflicts
+
+The C++ example for @acronym{GLR} (@pxref{GLR Parsers}) explains how to use
+the @acronym{GLR} parsing algorithm with some advanced features such as
address@hidden and @samp{%merge} to handle syntactically ambiguous
+grammars. However, the @acronym{GLR} algorithm can also be used in a simpler
+way to parse grammars that are unambiguous, but fail to be @acronym{LALR}(1).
+Such grammars typically require more than one symbol of lookahead,
+or (in rare cases) fall into the category of grammars in which the
address@hidden(1) algorithm throws away too much information (they are in
address@hidden(1), but not @acronym{LALR}(1), @ref{Mystery Conflicts}).
+
+Here is an example of this situation, using a problem that
+arises in the declaration of enumerated and subrange types in the
+programming language Pascal. These declarations look like this:
+
address@hidden
+type subrange = lo .. hi;
+type enum = (a, b, c);
address@hidden example
+
address@hidden
+The original language standard allows only numeric
+literals and constant identifiers for the subrange bounds (@samp{lo}
+and @samp{hi}), but Extended Pascal (ISO/IEC 10206:1990) and many other
+Pascal implementations allow arbitrary expressions there. This gives
+rise to the following situation, containing a superfluous pair of
+parentheses:
+
address@hidden
+type subrange = (a) .. b;
address@hidden example
+
address@hidden
+Compare this to the following declaration of an enumerated
+type with only one value:
+
address@hidden
+type enum = (a);
address@hidden example
+
address@hidden
+(These declarations are contrived, but they are syntactically
+valid, and more-complicated cases can come up in practical programs.)
+
+These two declarations look identical until the @samp{..} token.
+With normal @acronym{LALR}(1) one-token look-ahead it is not
+possible to decide between the two forms when the identifier
address@hidden is parsed. It is, however, desirable
+for a parser to decide this, since in the latter case
address@hidden must become a new identifier to represent the enumeration
+value, while in the former case @samp{a} must be evaluated with its
+current meaning, which may be a constant or even a function call.
+
+You could parse @samp{(a)} as an ``unspecified identifier in parentheses'',
+to be resolved later, but this typically requires substantial
+contortions in both semantic actions and large parts of the
+grammar, where the parentheses are nested in the recursive rules for
+expressions.
+
+You might think of using the lexer to distinguish between the two
+forms by returning different tokens for currently defined and
+undefined identifiers. But if these declarations occur in a local
+scope, and @samp{a} is defined in an outer scope, then both forms
+are possible---either locally redefining @samp{a}, or using the
+value of @samp{a} from the outer scope. So this approach cannot
+work.
+
+A solution to this problem is to use a @acronym{GLR} parser in its simplest
+form, i.e., without using special features such as @samp{%dprec} and
address@hidden When the @acronym{GLR} parser reaches the critical state, it
+simply splits into two branches and pursues both syntax rules
+simultaneously. Sooner or later, one of them runs into a parsing
+error. If there is a @samp{..} token before the next
address@hidden;}, the rule for enumerated types fails since it cannot
+accept @samp{..} anywhere; otherwise, the subrange type rule
+fails since it requires a @samp{..} token. So one of the branches
+fails silently, and the other one continues normally, performing
+all the intermediate actions that were postponed during the split.
+
+If the input is syntactically incorrect, both branches fail and the parser
+reports a syntax error as usual.
+
+The effect of all this is that the parser seems to ``guess'' the
+correct branch to take, or in other words, it seems to use more
+look-ahead than the underlying @acronym{LALR}(1) algorithm actually allows
+for. In this example, @acronym{LALR}(2) would suffice, but also some cases
+that are not @acronym{LALR}(@math{k}) for any @math{k} can be handled this way.
+
+Since there can be only two branches and at least one of them
+must fail, you need not worry about merging the branches by
+using dynamic precedence or @samp{%merge}.
+
+Another potential problem of @acronym{GLR} does not arise here, either. In
+general, a @acronym{GLR} parser can take quadratic or cubic worst-case time,
+and the current Bison parser even takes exponential time and space
+for some grammars. In practice, this rarely happens, and for many
+grammars it is possible to prove that it cannot happen. In
+in the present example, there is only one conflict between two
+rules, and the type-declaration context where the conflict
+arises cannot be nested. So the number of
+branches that can exist at any time is limited by the constant 2,
+and the parsing time is still linear.
+
+So here we have a case where we can use the benefits of @acronym{GLR}, almost
+without disadvantages. There are two things to note, though.
+First, one should carefully analyze the conflicts reported by
+Bison to make sure that @acronym{GLR} splitting is done only where it is
+intended to be. A @acronym{GLR} parser splitting inadvertently may cause
+problems less obvious than an @acronym{LALR} parser statically choosing the
+wrong alternative in a conflict.
+
+Second, interactions with the lexer (@pxref{Semantic Tokens}) must
+be considered with great care. Since a split parser consumes tokens
+without performing any actions during the split, the lexer cannot
+obtain information via parser actions. Some cases of
+lexer interactions can simply be eliminated by using @acronym{GLR}, i.e.,
+shifting the complications from the lexer to the parser. Remaining
+cases have to be checked for safety.
+
+In our example, it would be safe for the lexer to return tokens
+based on their current meanings in some symbol table, because no new
+symbols are defined in the middle of a type declaration. Though it
+is possible for a parser to define the enumeration
+constants as they are parsed, before the type declaration is
+completed, it actually makes no difference since they cannot be used
+within the same enumerated type declaration.
+
+Here is a Bison grammar corresponding to the example above. It
+parses a vastly simplified form of Pascal type declarations.
+
address@hidden
+%token TYPE DOTDOT ID
+
address@hidden
+%left '+' '-'
+%left '*' '/'
address@hidden group
+
+%%
+
address@hidden
+type_decl:
+ TYPE ID '=' type ';'
+;
address@hidden group
+
address@hidden
+type: '(' id_list ')'
+ | expr DOTDOT expr
+;
address@hidden group
+
address@hidden
+id_list: ID
+ | id_list ',' ID
+;
address@hidden group
+
address@hidden
+expr: '(' expr ')'
+ | expr '+' expr
+ | expr '-' expr
+ | expr '*' expr
+ | expr '/' expr
+ | ID
+;
address@hidden group
address@hidden example
+
+When used as a normal @acronym{LALR}(1) grammar, Bison correctly complains
+about one reduce/reduce conflict. In the conflicting situation the
+parser chooses one of the alternatives, arbitrarily the one
+declared first. Therefore the following correct input is not
+recognized:
+
address@hidden
+type t = (a) .. b;
address@hidden example
+
+The parser can be turned into a @acronym{GLR} parser, while also telling Bison
+to be silent about the one known reduce/reduce conflict, simply by
+adding these two declarations to the Bison input file:
+
address@hidden
+%glr-parser
+%expect-rr 1
address@hidden example
+
address@hidden
+No change in the grammar itself is required. Now the
+parser recognizes all valid declarations, according to the
+limited syntax above, transparently. In fact, the user does not even
+notice when the parser splits.
+
@node Locations Overview
@section Locations
@cindex location
@@ -1290,7 +1494,7 @@ not require it. You can add or change w
For example, this:
@example
-exp : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{}
+exp : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{} ;
@end example
@noindent
@@ -1300,6 +1504,7 @@ means the same thing as this:
exp: NUM
| exp exp '+' @{ $$ = $1 + $2; @}
| @dots{}
+;
@end example
@noindent
- Re: Submission for manual,
Paul Eggert <=