Re: Getting involved in Bison

From: Akim Demaille
Subject: Re: Getting involved in Bison
Date: Tue, 15 Oct 2019 08:42:03 +0200

Hi Victor1

> Le 15 oct. 2019 à 06:19, Paul Eggert <address@hidden> a écrit :
> On 10/14/19 7:12 PM, Morales Cayuela, Victor (NSB - CN/Hangzhou) wrote:
>> Could you let me know in which areas you would need help?
> Thanks for volunteering. Akim is the best person to ask.

Thanks :)

> Also, I suggest looking at Bison's TODO file for some ideas.
> https://git.savannah.gnu.org/cgit/bison.git/tree/TODO

Which was the impetus I needed to update it, see below.

For a small project, Bison is quite big, and requires really different skills 
depending on where you, Victor, would like to work on.  I strongly recommend 
starting with simple things (which is != from dummy).

On the backend side (aka skeleton), in C++, how about implementing push 
parsers?  That would be very useful in several projects I know.  It moderately 
difficult to implement "by hand", but you'll certainly find that m4 is a weird 
beast.  One path would be to generate a usual pull parser for say arithmetics, 
and work it by hand to become a push parser, and later see how to move these 
changes into lalr1.cc.

In bison itself (the generator), for a simple start, I would recommend cleaning 
up the graph generation.  Today it's sort of OOP with an abstract interface for 
graph, and a concrete implementation for Dot.  This is because decades ago we 
supported a format called VCG, which has disappeared since then.  I think we 
should flatten this to a direct interface for Dot, removing all the useless 

There are many more possible things, but it really depends what you'd like to 
work on, and how fluent you are in C (for bison the generator) and m4 (the 

diff --git a/TODO b/TODO
index f3f08ce1..d2c56b73 100644
--- a/TODO
+++ b/TODO
@@ -7,9 +7,6 @@ breaks.
 Also, we seem to teach YYPRINT very early on, although it should be
 considered deprecated: %printer is superior.
-** glr.cc
-move glr.c into the yy namespace
 ** improve syntax errors (UTF-8, internationalization)
 Bison depends on the current locale.  For instance:
@@ -58,7 +55,7 @@ Maybe we should exhibit the YYUNDEFTOK token.  It could also 
be assigned a
 semantic value so that yyerror could be used to report invalid lexemes.
 * Bison 3.6
-** Unit rules
+** Unit rules / Injection rules (Akim Demaille)
 Maybe we could expand unit rules (or "injections", see
 https://homepages.cwi.nl/~daybuild/daily-books/syntax/2-sdf/sdf.html), i.e.,
@@ -77,10 +74,12 @@ Practice' is impossible to find, but according to 'Parsing 
Techniques: a
 Practical Guide', it includes information about this issue.  Does anybody
 have it?
-** Injection rules
-See above.
+** clean up (Akim Demaille)
+Do not work on these items now, as I (Akim) have branches with a lot of
+changes in this area (hitting several files), and no desire to have to fix
+conflicts.  Addressing these items will happen after my branches have been
-** clean up
 *** lalr.c
 Introduce a goto struct, and use it in place of from_state/to_state.
 Rename states1 as path, length as pathlen.
@@ -130,6 +129,84 @@ $ ./tests/testsuite -l | grep errors | sed q
   38: input.at:1730      errors
 * Short term
+** Stop indentation in diagnostics
+Before Bison 2.7, we printed "flatly" the dependencies in long diagnostics:
+    input.y:2.7-12: %type redeclaration for exp
+    input.y:1.7-12: previous declaration
+In Bison 2.7, we indented them
+    input.y:2.7-12: error: %type redeclaration for exp
+    input.y:1.7-12:     previous declaration
+Later we quoted the source in the diagnostics, and today we have:
+    /tmp/foo.y:1.12-14: warning: symbol FOO redeclared [-Wother]
+        1 | %token FOO FOO
+          |            ^~~
+    /tmp/foo.y:1.8-10:      previous declaration
+        1 | %token FOO FOO
+          |        ^~~
+The indentation is no longer helping.  We should probably get rid of it, or
+maybe keep it only when -fno-caret. GCC displays this as a "note":
+    $ g++-mp-9 -Wall /tmp/foo.c -c
+    /tmp/foo.c:1:10: error: redefinition of 'int foo'
+        1 | int foo, foo;
+          |          ^~~
+    /tmp/foo.c:1:5: note: 'int foo' previously declared here
+        1 | int foo, foo;
+          |     ^~~
+Likewise for Clang, contrary to what I believed (because "note:" is written
+in black, so it doesn't show in my terminal :-)
+    $ clang++-mp-8.0 -Wall /tmp/foo.c -c
+    clang: warning: treating 'c' input as 'c++' when in C++ mode, this 
behavior is deprecated [-Wdeprecated]
+    /tmp/foo.c:1:10: error: redefinition of 'foo'
+    int foo, foo;
+             ^
+    /tmp/foo.c:1:5: note: previous definition is here
+    int foo, foo;
+        ^
+    1 error generated.
+** Better design for diagnostics
+The current implementation of diagnostics is adhoc, it grew organically.  It
+works as a series of calls to several functions, with dependency of the
+latter calls on the former.  For instance:
+      complain (&sym->location,
+                sym->content->status == needed ? complaint : Wother,
+                _("symbol %s is used, but is not defined as a token"
+                  " and has no rules; did you mean %s?"),
+                quote_n (0, sym->tag),
+                quote_n (1, best->tag));
+      if (feature_flag & feature_caret)
+        location_caret_suggestion (sym->location, best->tag, stderr);
+We should rewrite this in a more FP way:
+1. build a rich structure that denotes the (complete) diagnostic.
+   "Complete" in the sense that it also contains the suggestions, the list
+   of possible matches, etc.
+2. send this to the pretty-printing routine.  The diagnostic structure
+   should be sufficient so that we can generate all the 'format' of
+   diagnostics, including the fixits.
+If properly done, this diagnostic module can be detached from Bison and be
+put in gnulib.  It could be used, for instance, for errors caught by
+There's certainly already something alike in GCC.  At least that's the
+impression I get from reading the "-fdiagnostics-format=FORMAT" part of this
 ** consistency
 token vs terminal
@@ -139,11 +216,10 @@ itself uses int (for yylen for instance), yet stack is 
based on size_t.
 Maybe locations should also move to ints.
-** C
-Introduce state_type rather than spreading yytype_int16 everywhere?
-** glr.c
-yyspaceLeft should probably be a pointer diff.
+Paul Eggert already covered most of this.  But before publishing these
+changes, we need to ask our C++ users if they agree with that change, or if
+we need some migration path.  Could be a %define variable, or simply
+%require "3.5".
 ** Graphviz display code thoughts
 The code for the --graph option is over two files: print_graph, and
@@ -164,9 +240,6 @@ Little effort seems to have been given to factoring these 
files and their
 rint{,-xml} counterpart. We would very much like to re-use the pretty format
 of states from .output for the graphs, etc.
-Also, the underscore in print_graph.[ch] isn't very fitting considering the
-dashes in the other filenames.
 Since graphviz dies on medium-to-big grammars, maybe consider an other tool?
 ** push-parser
@@ -224,11 +297,13 @@ since it is no longer bound to a particular parser, it's 
just a
 (standalone symbol).
 * Various
-** Rewrite glr.cc in C++
+** Rewrite glr.cc in C++ (Valentin Tolmer)
 As a matter of fact, it would be very interesting to see how much we can
 share between lalr1.cc and glr.cc.  Most of the skeletons should be common.
 It would be a very nice source of inspiration for the other languages.
+Valentin Tolmer is working on this.
 Defined to 256, but not used, not documented.  Probably the token
 number for the error token, which POSIX wants to be 256, but which
@@ -298,10 +373,21 @@ other improvements and also made it faster (probably 
because memory
 management is performed once instead of three times).  I suggest that
 we do the same in yacc.c.
+(Some time later): it's also very nice to have three stacks: it's more dense
+as we don't lose bits to padding.  For instance the typical stack for states
+will use 8 bits, while it is likely to consume 32 bits in a struct.
+We need trustworthy benchmarks for Bison, for all our backends.  Akim has a
+few things scattered around; we need to put them in the repo, and make them
+more useful.
 ** yysyntax_error
 The code bw glr.c and yacc.c is really alike, we can certainly factor
 some parts.
+This should be worked on when we also address the expected improvements for
+error generation (e.g., i18n).
 * Report
@@ -341,7 +427,26 @@ LORIA, INRIA Nancy - Grand Est, Nancy, France
 * Extensions
 ** Multiple start symbols
-Would be very useful when parsing closely related languages.
+Would be very useful when parsing closely related languages.  The idea is to
+declare several start symbols, for instance
+    %start stmt expr
+    %%
+    stmt: ...
+    expr: ...
+and to generate parse(), parse_stmt() and parse_expr().  Technically, the
+above grammar would be transformed into
+   %start yy_start
+   %%
+   yy_start: YY_START_STMT stmt | YY_START_EXPR expr
+so that there are no new conflicts in the grammar (as would undoubtedly
+happen with yy_start: stmt | expr).  Then adjust the skeletons so that this
+initial token (YY_START_STMT, YY_START_EXPR) be shifted first in the
+corresponding parse function.
 ** Better error messages
 The users are not provided with enough tools to forge their error messages.
@@ -359,6 +464,12 @@ should make this reasonably easy to implement.
 Bruce Mardle <address@hidden>
+However, there are many other things to do before having such a feature,
+because I don't want a % equivalent to #include (which we all learned to
+hate).  I want something that builds "modules" of grammars, and assembles
+them together, paying attention to keep separate bits separated, in pseudo
+name spaces.
 ** Push parsers
 There is demand for push parsers in Java and C++.  And GLR I guess.
@@ -385,6 +496,10 @@ must be in the scanner: we must not parse what is in a 
switched off
 part of %if.  Akim Demaille thinks it should be in the parser, so as
 to avoid falling into another CPP mistake.
+(Later): I'm sure there's actually good case for this.  People who need that
+feature can use m4/cpp on top of Bison.  I don't think it is worth the
+trouble in Bison itself.
 ** XML Output
 There are couple of available extensions of Bison targeting some XML
 output.  Some day we should consider including them.  One issue is
@@ -404,6 +519,9 @@ XML output for GNU Bison
+Andrew Myers and Vincent Imbimbo are working on this item, see
 * Coding system independence
 Paul notes:
@@ -433,6 +551,7 @@ It is unfortunate that there is a total order for 
precedence.  It
 makes it impossible to have modular precedence information.  We should
 move to partial orders (sounds like series/parallel orders to me).
+This is a prerequisite for modules.
 * $undefined
 From Hans:

