[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Initial %include implementation

From: David M. Warme
Subject: Re: Initial %include implementation
Date: Mon, 26 Nov 2018 03:27:38 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0


First thing to note is that our original implementation of this was
more than 20 years ago (1996?).  I'm not sure if ANTLR even existed back
then.  There was very little active Bison development going on,
and Richard Stallman preferred to leave it alone (he didn't understand the
grammar ==> state machine conversion algorithm, and was quite concerned
about breaking things like gcc).

Also, I guess you kind of forked Bison and don’t track to follow
its evolutions, do you?

We have a set of patches that we occasionally port to more recent
versions of Bison -- usually when the old version of Bison stops
building or working correctly after a platform or operating system
upgrade.  The most recent version of Bison that we are working off
is 2.4.3 (circa 2010).  This still works fine on CentOs 7, so we have no
compelling reason to push these patches into any more recent version of

Actually, I dislike %union, that’s why I introduced %define
api.value.type union:
Did you consider using this instead of %union?  Or do you see some
value in %union that api.value.type=union does not offer?

I was not aware of api.value.type.  I do not immediately grasp the
problem (with %union) that the new api.value.type feature is designed
to overcome.  Please explain (or point me at a thread that does so)
if you think it is really important to the discourse.

I remember that you were among those who needed some of the tables
that we no generate; were these tables playing in role in this
modular-grammars support?


Or were they for completely different issues?

We use our own parsing skeleton file.  There were efforts to eliminate
as many of the tables as possible, yytname[] in particular.  (There
appeared to be a fair bit of interest in making the generated parsers
be as small as possible.)

In general, anybody who is using their own parsing skeleton would like
a way to *force* Bison to generate a *specified* subset of its general
suite of tables.  I think at one point our patches provided a declaration
like %generate_tables <list of table names> that would force Bison to
generate each specified table whether Bison thought them necessary or not.
This may be largely unnecessary now that Bison uses M4 -- I think Bison
now defines macros for all of the tables (whether used or not) and the
skeleton can decide which of them to expand.

So you really developed a preprocessor to Bison, right?

Correct.  A set of Gnu M4 macros that used the m4_divert()
facility to ensure that each section of each include file wound
up in the correct section of the final generated .y file.

You didn’t even have to hack Bison, or did you?

I think our initial implementation used a "stock" version of
Bison, and it stayed that way for quite a few years.  During a
"business downturn," in 2010 we had some time to polish our tool
chain a bit, and we finally modified Bison to support the
%begin-library and %end-library directives (which modify Bison's
"useless symbol and rule" reporting behavior, but only when the
--warn-by-library option is specified on the command line).  These
%begin-library and %end-library declarations were automatically
added to the generated .y file whenever Include_Grammar() was
used.  We also modified M4 to automatically generate make
dependencies.  The net result for our code base was that:

        - All dependencies for C, C++ and grammars are now
          generated automatically throughout our build process.

        - We could finally build all of our software with *no*
          error or warning messages (from Bison, or any other
          tool -- Bison was the last "hold out" in this regard).

Yep.  I’m also worried with associativity/precedence directives,
with unexpected fusions between non-terminals that were meant
to be different, but just happen to have the same name, etc.

Our grammars do not use %prec or %assoc.  We adopted coding
standards that *require* unambiguous grammars.

We do not do anything to prevent include files and
from colliding by internally using the same non-terminal, leading
to the "unexpected fusion" that you refer to.  On the other hand,
this allows us to "extend" the grammar provided by some include
file (i.e., in the main grammar file).  This is not a feature
that we use very often, if at all.  We have not found "unexpected
fusion" to be a problem in practice.  Good naming practices
prevent such mishaps.

We do have one name collision issue that is really more of a "name
divergence" issue concerning tokens that are keywords.  Matches
against the lexer's keyword table are done in a case-insensitive
manner, so all of the grammar's keywords get recognized/accepted
from the input file in a case-insensitive manner.  (The data values
for IDENT tokens, however, preserve the case of the identifier as it
appears in the input file.)  Suppose we have the following rules
(perhaps appearing in separate include files):

        a : b "AND" c { ... };

        p : q "and" r { ... };

Bison treats "AND" and "and" as two distinct terminals having two
distinct token numbers.  The lexer recognizes all identifiers
matching [Nn][Oo][Tt] as being this keyword -- but there is no way
to decide which of the two token numbers the lexer must return.
Grammars having this structure produce a core dump from
init_special_tokens() during yyparse() initialization.  All keywords
appearing in grammar files must be spelled consistently (with respect
to user of upper/lower case letters).

Do you have any form of scoping in your modules?


We do, however, implement a crude form of "templated" grammars.
These are obtained as follows:

Include_Grammar2(, foo_)
Include_Grammar2(, bar_)
Include_Grammar2(, baz_)

Within the "" grammar header, all occurrences of


get replaced with foo_, bar_, or baz_, as appropriate.  We code
all of these occurrences as:


for safety when concatenating things together.  (Yes, ugly.)  This allows
us to write a grammatical phrase having one or more non-terminals that
are not defined in the include file, but are "place holders" for
syntax defined by the top-level grammar.  For example, perhaps
defines some sort of "container" syntax, and the grammar for the
"entries" in the container are specified by this "template parameter".
The top-level grammar need only specify:

        foo_datum :     fnum            { ... };
        bar_datum :     distribution    { ... };
        baz_datum :     yes_or_no       { ... };

Once again, this is implemented entirely by M4 macros, and it is the
developer's responsibility to assure that all of the non-terminals defined
in "" have {[]}ARG2{[]} in them somewhere (usually at the beginning
of the symbol name) so that the symbols between different inclusions don't
collide (unless, of course you include it with duplicate ARG2 values).

>  1. Write a separate, custom lexical analyzer (floor to ceiling)
>     for each top-level grammar that you process.  (Yikes!)

The Yikes is about the source code and the efforts put on
the developers, and not about the code bloat in the object code,

Yikes is mostly about developer effort, but bloat certainly applies
if you take this route.  With > 150 grammars in our code base,
having > 150 distinct lexical analyzers becomes *obscene*.

You appear to really depend a lot on the syntax of yytname,
*please*, have a look at this proposal, and comment it!
I _need_ feedback.

I certainly saw this thread go by (fairly recently, if I recall).
My opinion on the matter:

1. For "special" Bison-internal symbols, the corresponding string
   appearing in yytname[] should be *invalid* to use as a symbol in
   any Bison input file.  Consistently putting a "$" prefix on such
   symbols accomplishes this, and prevents such Bison-internal
   symbols from colliding with any symbol in the user's grammar.

2. For all user-specified symbols, the corresponding string appearing
   in yytname[] should be identical to what the user specified in
   the Bison input file.  Grammars that use double-quoted strings as
   terminal symbols should definitely include the double-quotes as part
   of the yytname[] string data.

Beyond that, I don't really care too much whether these characters
are ASCII, Arabic, or Klingon, nor do I care much about how these
characters are encoded in the resulting string constants -- as long
as traditional ASCII characters remain the identical sequence of
ASCII characters in yytname[], I am happy.  Things get a bit trickier,
e.g., if you suddenly decide that the type of yytname must become
wchar_t * yytname[]; (unless you provide some means of forcing it
back to char * yytname[]).

>    a. Keywords.  These are first recognized as an IDENT, but then looked up
>    in a hash table of "exceptions".  (Our implementation allows you to make
>    the keyword be "reserved" if the yytname[] entry ends with a !, (e.g.,
>    "\"begin!\"" and "\"end!\"" in the above example).  In this case the
>    keyword's token number is always returned.  If the keyword is NOT
>    reserved, our implementation checks to see if the keyword's token
>    number is "shiftable" in the current parsing context.  If so, it
>    returns the keyword's token number, otherwise it returns IDENT's
>    token number.)>

That’s interesting!  So they are not really reserved words, they
behave like say ‘override' in C++.


We have over 150 different Bison grammars linked into a single executable
file.  It would be a shame if some grammar header file used a keyword
like "flag" somewhere, that prevented the user from using "flag" as a
variable (IDENT token) in some other portion of the input file that was
implementing, e.g., some sort of general programming language.

This is why we decided that (by default) identifier-like keyword tokens are
"keywords" only -- not "reserved" words.  Such a keyword can be specified
to be reserved by suffixing it with "!", as in "begin!" and "end!"  (We
specifically chose to reserve begin and end because we have lots sections
in our grammars that look like

        Begin Foo_Bar_Section
        End Foo_Bar_Section

Reserving the "begin!" and "end!" keywords provides definite advantages
with respect to the behavior of explicit error recovery rules.)

A trie???  How many punctuators do you have for a trie to be relevant?
Or maybe it’s for an easy implementation of the longest-match?

You are a smart guy!  Yes, it is both fast, and it gives you easy
longest-match.  We don't go out of our way to litter our grammars with
strange punctuators, so this trie doesn't usually have too many nodes, and
it only exists while the yyparse() function is executing.

Thanks a lot for taking the time to craft a precise description of
your work!

You are most welcome.  I monitor the Bison lists (*lots* of traffic
at the moment! :^).  The proposed patch for an "include" feature seemed
like a good prompt for sharing what I know about what is really needed
in practice.

I have one last question: why Bison?  For instance, ANTLR would
probably also have made sense, and it features things that Bison
does not.

The decision to use Bison was made circa 1996.  I don't think I knew
about ANTLR back then, but had been using Bison since 1985, and yacc
before then.  So basically, Bison was by far the most familiar tool
for me to harness back in 1996.

Other things you should know:

- Our skeleton file generates C++ code.

- Each grammar defines a new C++ class, derived from a Parser base class.

- Each such class:
        - Has its own yyparse() member function.
        - Owns a Lexer object used by the parser.  (This Lexer object is
          customized from yytname[] during yyparse() initialization.)

- This makes it a simple matter to integrate > 150 distinct grammars into
  one executable file.  (No rename cruft as needed in C.)

- The Parser base class provides a pure virtual accessor function
  (implemented by each derived class / grammar) that returns a
  "const ParserTables *", defined as follows:

        struct ParserTables {
                int                     d_yyntokens;
                int                     d_yynnts;
                int                     d_yynrules;
                int                     d_yynstates;
                int                     d_yylast;
                int                     d_yyfinal;
                int                     d_yypact_ninf;
                int                     d_yytable_ninf;
                int                     d_yyntbase;
                int                     d_yynsyms;
                const short *           d_yyprhs;
                const short *           d_yyrhs;
                const char * const *    d_yytname;
                const short *           d_yyr1;
                const short *           d_yyr2;
                const short *           d_yyrline;
                const short *           d_yydefact;
                const short *           d_yydefgoto;
                const short *           d_yypact;
                const short *           d_yypgoto;
                const short *           d_yytable;
                const short *           d_yycheck;
                const char *            d_parserName;   // name of parser class 
that defines this grammar

This allows use of any Parser class (for any associated grammar) with
a GenericParser, which produces a "generic" parse tree, whose interior
nodes represent the application of a particular rule number, and leaf
nodes that represent particular terminal symbols.  Each such terminal
node (including that for the $eof token) also contains all of the white
space and comments preceding said token.  This generic parse tree can be
recursively traversed in a trivial manner to re-create a character-for-
character copy of the input file from which it was parsed.  The meaning
of each rule number, and symbol in the generic parse tree can be
deduced by properly interpreting all of the data and tables provided
by a grammar as encapsulated in this single structure.  We have a
special GrammarHelper class that encapsulates many of the operations
needed to search for selected sub-trees, and to analyze and modify
such generic parse trees.

We use this for several things, but mostly to provide GUI tools that
edit several of the "flat" input files in a manner that preserves
indentation, comments, etc.  GUI tools are better when editing certain
of our "flat" input files that represent geometric data such as
geographical regions, or road networks.  (Staring at long sequences of
latitude/longitude pairs in textual form is not terribly meaningful.
Displaying then upon a map and letting users click and drag points
around helps a great deal.)

David Warme

reply via email to

[Prev in Thread] Current Thread [Next in Thread]