Re: Submission for manual

bison-patches
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Submission for manual

From:	Paul Eggert
Subject:	Re: Submission for manual
Date:	Mon, 21 Jun 2004 13:57:45 -0700
User-agent:	Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)
Frank Heckenbach <address@hidden> writes:

> I've applied your other comments.

Thanks for that nice contribution to the manual.  I made some minor
editorial changes (e.g., simplifying the example grammar a bit, using
two spaces after sentence-ends) and merged it into the Bison manual.
Here's the patch that I installed.  Please let me know if you see any
problems with it.

2004-06-21  Frank Heckenbach  <address@hidden>

        * doc/bison.texinfo (Simple GLR Parsers): New section.

Index: doc/bison.texinfo
===================================================================
RCS file: /cvsroot/bison/bison/doc/bison.texinfo,v
retrieving revision 1.125
diff -p -u -r1.125 bison.texinfo
--- doc/bison.texinfo   21 Jun 2004 20:20:30 -0000      1.125
+++ doc/bison.texinfo   21 Jun 2004 20:51:59 -0000
@@ -135,7 +135,8 @@ The Concepts of Bison
                         a semantic value (the value of an integer,
                         the name of an identifier, etc.).
 * Semantic Actions::  Each rule can have an action containing C code.
-* GLR Parsers::       Writing parsers for general context-free languages
+* GLR Parsers::       Writing parsers for general context-free languages.
+* Simple GLR Parsers::    Using GLR in its simplest form.
 * Locations Overview::    Tracking Locations.
 * Bison Parser::      What are Bison's input and output,
                         how is the output used?
@@ -381,7 +382,8 @@ use Bison or Yacc, we suggest you start 
                         a semantic value (the value of an integer,
                         the name of an identifier, etc.).
 * Semantic Actions::  Each rule can have an action containing C code.
-* GLR Parsers::       Writing parsers for general context-free languages
+* GLR Parsers::       Writing parsers for general context-free languages.
+* Simple GLR Parsers::    Using GLR in its simplest form.
 * Locations Overview::    Tracking Locations.
 * Bison Parser::      What are Bison's input and output,
                         how is the output used?
@@ -860,6 +862,208 @@ will suffice.  Otherwise, we suggest
 address@hidden
 @end example
 
address@hidden Simple GLR Parsers
address@hidden Using @acronym{GLR} in its Simplest Form
address@hidden @acronym{GLR} parsing, unambiguous grammars
address@hidden generalized @acronym{LR} (@acronym{GLR}) parsing, unambiguous 
grammars
address@hidden %glr-parser
address@hidden %expect-rr
address@hidden conflicts
address@hidden reduce/reduce conflicts
+
+The C++ example for @acronym{GLR} (@pxref{GLR Parsers}) explains how to use
+the @acronym{GLR} parsing algorithm with some advanced features such as
address@hidden and @samp{%merge} to handle syntactically ambiguous
+grammars.  However, the @acronym{GLR} algorithm can also be used in a simpler
+way to parse grammars that are unambiguous, but fail to be @acronym{LALR}(1).
+Such grammars typically require more than one symbol of lookahead,
+or (in rare cases) fall into the category of grammars in which the
address@hidden(1) algorithm throws away too much information (they are in
address@hidden(1), but not @acronym{LALR}(1), @ref{Mystery Conflicts}).
+
+Here is an example of this situation, using a problem that
+arises in the declaration of enumerated and subrange types in the
+programming language Pascal.  These declarations look like this:
+
address@hidden
+type subrange = lo .. hi;
+type enum = (a, b, c);
address@hidden example
+
address@hidden
+The original language standard allows only numeric
+literals and constant identifiers for the subrange bounds (@samp{lo}
+and @samp{hi}), but Extended Pascal (ISO/IEC 10206:1990) and many other
+Pascal implementations allow arbitrary expressions there.  This gives
+rise to the following situation, containing a superfluous pair of
+parentheses:
+
address@hidden
+type subrange = (a) .. b;
address@hidden example
+
address@hidden
+Compare this to the following declaration of an enumerated
+type with only one value:
+
address@hidden
+type enum = (a);
address@hidden example
+
address@hidden
+(These declarations are contrived, but they are syntactically
+valid, and more-complicated cases can come up in practical programs.)
+
+These two declarations look identical until the @samp{..} token.
+With normal @acronym{LALR}(1) one-token look-ahead it is not
+possible to decide between the two forms when the identifier
address@hidden is parsed.  It is, however, desirable
+for a parser to decide this, since in the latter case
address@hidden must become a new identifier to represent the enumeration
+value, while in the former case @samp{a} must be evaluated with its
+current meaning, which may be a constant or even a function call.
+
+You could parse @samp{(a)} as an ``unspecified identifier in parentheses'',
+to be resolved later, but this typically requires substantial
+contortions in both semantic actions and large parts of the
+grammar, where the parentheses are nested in the recursive rules for
+expressions.
+
+You might think of using the lexer to distinguish between the two
+forms by returning different tokens for currently defined and
+undefined identifiers.  But if these declarations occur in a local
+scope, and @samp{a} is defined in an outer scope, then both forms
+are possible---either locally redefining @samp{a}, or using the
+value of @samp{a} from the outer scope.  So this approach cannot
+work.
+
+A solution to this problem is to use a @acronym{GLR} parser in its simplest
+form, i.e., without using special features such as @samp{%dprec} and
address@hidden  When the @acronym{GLR} parser reaches the critical state, it
+simply splits into two branches and pursues both syntax rules
+simultaneously.  Sooner or later, one of them runs into a parsing
+error.  If there is a @samp{..} token before the next
address@hidden;}, the rule for enumerated types fails since it cannot
+accept @samp{..} anywhere; otherwise, the subrange type rule
+fails since it requires a @samp{..} token.  So one of the branches
+fails silently, and the other one continues normally, performing
+all the intermediate actions that were postponed during the split.
+
+If the input is syntactically incorrect, both branches fail and the parser
+reports a syntax error as usual.
+
+The effect of all this is that the parser seems to ``guess'' the
+correct branch to take, or in other words, it seems to use more
+look-ahead than the underlying @acronym{LALR}(1) algorithm actually allows
+for.  In this example, @acronym{LALR}(2) would suffice, but also some cases
+that are not @acronym{LALR}(@math{k}) for any @math{k} can be handled this way.
+
+Since there can be only two branches and at least one of them
+must fail, you need not worry about merging the branches by
+using dynamic precedence or @samp{%merge}.
+
+Another potential problem of @acronym{GLR} does not arise here, either.  In
+general, a @acronym{GLR} parser can take quadratic or cubic worst-case time,
+and the current Bison parser even takes exponential time and space
+for some grammars.  In practice, this rarely happens, and for many
+grammars it is possible to prove that it cannot happen.  In
+in the present example, there is only one conflict between two
+rules, and the type-declaration context where the conflict
+arises cannot be nested.  So the number of
+branches that can exist at any time is limited by the constant 2,
+and the parsing time is still linear.
+
+So here we have a case where we can use the benefits of @acronym{GLR}, almost
+without disadvantages.  There are two things to note, though.
+First, one should carefully analyze the conflicts reported by
+Bison to make sure that @acronym{GLR} splitting is done only where it is
+intended to be.  A @acronym{GLR} parser splitting inadvertently may cause
+problems less obvious than an @acronym{LALR} parser statically choosing the
+wrong alternative in a conflict.
+
+Second, interactions with the lexer (@pxref{Semantic Tokens}) must
+be considered with great care.  Since a split parser consumes tokens
+without performing any actions during the split, the lexer cannot
+obtain information via parser actions.  Some cases of
+lexer interactions can simply be eliminated by using @acronym{GLR}, i.e.,
+shifting the complications from the lexer to the parser.  Remaining
+cases have to be checked for safety.
+
+In our example, it would be safe for the lexer to return tokens
+based on their current meanings in some symbol table, because no new
+symbols are defined in the middle of a type declaration.  Though it
+is possible for a parser to define the enumeration
+constants as they are parsed, before the type declaration is
+completed, it actually makes no difference since they cannot be used
+within the same enumerated type declaration.
+
+Here is a Bison grammar corresponding to the example above.  It
+parses a vastly simplified form of Pascal type declarations.
+
address@hidden
+%token TYPE DOTDOT ID
+
address@hidden
+%left '+' '-'
+%left '*' '/'
address@hidden group
+
+%%
+
address@hidden
+type_decl:
+          TYPE ID '=' type ';'
+;
address@hidden group
+
address@hidden
+type:     '(' id_list ')'
+        | expr DOTDOT expr
+;
address@hidden group
+
address@hidden
+id_list:  ID
+        | id_list ',' ID
+;
address@hidden group
+
address@hidden
+expr:     '(' expr ')'
+        | expr '+' expr
+        | expr '-' expr
+        | expr '*' expr
+        | expr '/' expr
+        | ID
+;
address@hidden group
address@hidden example
+
+When used as a normal @acronym{LALR}(1) grammar, Bison correctly complains
+about one reduce/reduce conflict.  In the conflicting situation the
+parser chooses one of the alternatives, arbitrarily the one
+declared first.  Therefore the following correct input is not
+recognized:
+
address@hidden
+type t = (a) .. b;
address@hidden example
+
+The parser can be turned into a @acronym{GLR} parser, while also telling Bison
+to be silent about the one known reduce/reduce conflict, simply by
+adding these two declarations to the Bison input file:
+
address@hidden
+%glr-parser
+%expect-rr 1
address@hidden example
+
address@hidden
+No change in the grammar itself is required.  Now the
+parser recognizes all valid declarations, according to the
+limited syntax above, transparently.  In fact, the user does not even
+notice when the parser splits.
+
 @node Locations Overview
 @section Locations
 @cindex location
@@ -1290,7 +1494,7 @@ not require it.  You can add or change w
 For example, this:
 
 @example
-exp   : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{}
+exp   : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{} ;
 @end example
 
 @noindent
@@ -1300,6 +1504,7 @@ means the same thing as this:
 exp:      NUM
         | exp exp '+'    @{ $$ = $1 + $2; @}
         | @dots{}
+;
 @end example
 
 @noindent
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Submission for manual, Paul Eggert <=
- Message not available
  - Re: Submission for manual, Paul Eggert, 2004/06/22
Prev by Date: "lookahead" -> "look-ahead" for Bison
Next by Date: Re: Submission for manual
Previous by thread: "lookahead" -> "look-ahead" for Bison
Next by thread: Re: Submission for manual
Index(es):
- Date
- Thread