[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Initial %include implementation

From: David M. Warme
Subject: Re: Initial %include implementation
Date: Sat, 17 Nov 2018 18:52:45 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

The %include feature (as coded in this patch) is not terribly useful.
Here is why:

A proper bison source file is partitioned into various sections, and
these sections appear in a certain order (as quoted from the latest
Bison manual):


        Bison declarations

        Grammar rules


The canonical use-case for %include is to be able to share
commonly used grammatical sub-phrases among several top-level grammars.
For example, a non-terminal "distribution" that recognizes any of
several random distributions (constant, uniform, normal, exponential,
gamma, beta, weibull, etc.), each having their own particular syntax.
This %include file (Akim called it a "module") could easily need to
deposit various bits of information into all four of the sections
(prologue, declarations, rules and epilogue) to properly encapsulate
all of the functionality associated with this grammatical sub-phrase.
(Actually, there is a 5th pertinent section -- the "%union items".)

The suggested patch is just a classic hierarchical "character stream"
that would force all %included text to appear immediately at the point
in the particular section in which the %include directive is placed.
Think about how bollocksed up Bison would get if such an include file
contained a %% delimiter!

In reality, you probably want each include file to have structure
similar to the top-level grammar file, with 5 individual sections of
its own.

I actually implemented a facility to do exactly this more than 20 years
ago -- but I implemented it as a set of GNU M4 macros, not internally
to Bison.  These macros process a ".g" top-level grammar file,
possibly "including" several ".gh" grammar header files, and produce
a single ".y" Bison source file.  It provides the following facilities:

        Include_Grammar(        /* Include "grammar header" 
file */
                /* Text belonging in Prologue section */
                /* Text belonging inside %union { ... } */
                /* Text belonging in Bison declarations section */
                /* Text belonging in the Grammar rules section */
                /* Text belonging in the Epilogue section */

These macros use M4's "m4_divert()" facility (a separate diversion
for each of the 5 sections defined above), and upon detecting EOF,
it concatenates all 5 of these diversions into a single Bison
source file having the proper delimiters between these sections.

This is just the "tip of the iceberg," however.  The next thing you
encounter when partitioning grammars into modules like this is token
numbers.  Each top-level grammar "sees" a distinct set of terminal
symbols -- and Bison assigns a unique token number to each of these
terminals.  But you can pretty much guarantee that if Bison picks
token number 163 for the LEFT_PAREN token when processing top-level
grammar "foo.y", you can be pretty certain that when Bison processes
top-level grammar "bar.y" it is going to pick something *other* than
token number 163 for the LEFT_PAREN terminal.  This means that you
are going to have to do one of the following things in order to
get properly working lexical analyzers for these grammars:

  1. Write a separate, custom lexical analyzer (floor to ceiling)
     for each top-level grammar that you process.  (Yikes!)

  2. Write some sort of "templated" lexical analyzer that is
     shared by all top-level grammars, but that you must re-compile
     separately for each top-level grammar.  For example, it could
     #include the "" file that is associated with the
     particular top-level grammar this lexical analyzer will be
     used with, and do #ifdef LEFT_PAREN ... #endif type stuff to
     determine whether this grammar needs to recognize "(" tokens
     or not.  Still pretty gnarly, and it makes the lexical analyzer
     a bottleneck for everyone who is writing new grammar (e.g.,
     adding new keywords, etc.)

  3. Write a single "chameleon" lexical analyzer that can dynamically
     configure itself to the particular grammar it is being asked to
     lex.  Using a custom Bison parser skeleton, for example, one can arrange
     to have the following code executed early in the initialization
     phases of the generated yyparse() function:

        d_lexer -> initSpecialTokens (YYNTOKENS, yytname);

     This function slogs through the yytname[] array, deducing the
     token number of each terminal symbol, and what "lexical syntax"
     is required for each such token number.  Here is a notional
     yytname[] that I extracted (and cleansed) from one of our grammars:

        #define YYNTOKENS       43

        static const char *const yytname[] =
          "$end", "error", "$undefined", "IDENT", "STRING", "CCONST", "ICONST",
          "FCONST", "\",\"", "\"or\"", "\"and\"", "\"eq\"", "\"ne\"", "\"<\"",
          "\"<=\"", "\">\"", "\">=\"", "\"+\"", "\"-\"", "\"*\"", "\"/\"", 
          "\"(\"", "\")\"", "\"not\"", "\"true\"", "\"false\"", "\"::\"", 
          "\"]\"", "\":\"", "\"{\"", "\"}\"", "\"keyword1\"", "\"keyword2\"",
          "\"keyword3\"", "\"begin!\"", "\"end!\"", "\";\"", "\"=\"",
          "\"keyword4\"", "\"keyword5\"", "\"keyword6\"", "$accept",
          /* Other entries for non-terminal symbols omitted. */

     Our "chameleon" lexer internally knows about whitespace, C and C++
     style comments, and the following "primitive" token types:

        IDENT           - C or C++ style identifiers.
        STRING          - Double-quoted strings.
        CCONST          - Single-quoted strings (character constants).
        ICONST          - Integer constants
        FCONST          - Floating-point constants

     All other tokens in yytname[] must be double-quoted strings that split
     cleanly into one of the following two buckets:

     a. Keywords.  These are first recognized as an IDENT, but then looked up
        in a hash table of "exceptions".  (Our implementation allows you to make
        the keyword be "reserved" if the yytname[] entry ends with a !, (e.g.,
        "\"begin!\"" and "\"end!\"" in the above example).  In this case the
        keyword's token number is always returned.  If the keyword is NOT
        reserved, our implementation checks to see if the keyword's token
        number is "shiftable" in the current parsing context.  If so, it
        returns the keyword's token number, otherwise it returns IDENT's
        token number.)

     b. Punctuators.  These consist of one or more entirely non-alphabetic,
        non-digit characters.  (Things containing /*, */ and // are illegal.)

     After analyzing each token entry of yytname[], our chameleon lexer sets
     up all of the hash tables (for keywords), and trie's (for punctuators)
     necessary to quickly recognize each of these tokens as they appear in
     the input file, and return the proper token number that the grammar
     being parsed expects.

     This gives us the luxury of a single lexical analyzer (fairly complex, but
     write it once and be done with it), and then the ability to write grammar
     fragments such as the following:

        /* In the "" grammar header file... */

        #include "distribution_pnode.h"               /* Defines struct 
distribution_pnode */



        struct distribution_pnode *     y_distribution;


        %type <y_distribution>    constant_distribution
        %type <y_distribution>    distribution
        %type <y_distribution>    exponential_distribution
        %type <y_distribution>    normal_distribution
        %type <y_distribution>    uniform_distribution

                | uniform_distribution
                | normal_distribution
                | exponential_distribution

        constant_distribution:  "constant" "(" fnum ")"
                        { $$ = constant_distribution_create (@2, $3); };
        uniform_distribution:   "uniform" "(" fnum "," fnum ")"
                        { $$ = uniform_distribution_create (@2, $3, $5); };
        normal_distribution:    "normal" "(" fnum "," fnum ")"
                        { $$ = normal_distribution_create (@2, $3, $5); };
                        { $$ = exponential_distribution_create (@2, $3, $5); };

        /* In the "" grammar header file... */



        %type <y_double>  fnum


                  FCONST        { $$ = $1; }
                | "+" FCONST  { $$ = + $2; }
                | "-" FCONST  { $$ = - $2; }
                | ICONST        { $$ = $1; }
                | inum          { $$ = $1; }

        /* In the "" grammar header file... */


        %type <y_long>            inum

                | ICONST        { $$ = $1; }
                | "+" ICONST  { $$ = + $2; }
                | "-" ICONST  { $$ = - $2; }

There are some other things that you discover when you go down this road.
Bison starts generating lots of error messages.  For example, when your
last grammatical reference to "distribution" gets deleted from your top-
level grammar, Bison complains about distribution, constant_distribution,
uniform_distribution, normal_distribution, exponential_distribution --
and perhaps even fnum and inum -- as rules/non-terminals that are defined
but never used by the grammar.  We implemented the following:

        %begin-library ""



All of the rules and declarations occurring between these two directives
are considered to be part of a library (or module) named "".
It is acceptable to use only a portion of the rules/symbols appearing in
a library, and (our modified version of) Bison does not report any of
them as being "useless".  However, if *all* of them are unused, then it
instead reports that the corresponding "library" is useless.

I regret that I cannot share our actual implementation of all this with
you because the code is "owned" by the US Government, and I do not have
the authority to release any of it.  None of it is rocket science,
however.  Once you understand the issues that need to be addressed, it
is pretty clear how to do so.

We implemented the basic infrastructure more than 20 years ago, and did
some significant additional renovations about 5 years ago:

        - The %begin-library and %end-libary directives.

        - Make Include_Grammar automatically implement "include guard"
          functionality that prevents multiple inclusion.

        - Modify M4 (that still implements our "grammar file inclusion"
          capability) with the ability to automatically create "make"
          dependencies.  (The M4 maintainers have *still* not included
          these enhancements into any released version of M4.)

These allowed us to update our "make" system to automatically generate
dependencies for grammar files, and we finally obtained completely "clean"
builds of our software (no more useless rule/symbol warnings from Bison
because of partially used include files).

Although I cannot contribute this code, I hope that the knowledge and
experience that I have gained making all of this machinery work (well)
can help inform the Bison developer community about what works, what
does not, and why.

David Warme

On 11/11/2018 12:05 PM, Akim Demaille wrote:

Le 10 nov. 2018 à 15:08, Egor Pugin <address@hidden> a écrit :

Let '#include' die, welcome ‘import'.
I see.
But, isn’t include and import could be orthogonal in bison?
I don’t want to have two features to maintain.

It's nice to have common prologues/epilogues, also parts of grammar,
as you said.
Maybe %include is nice to have feature, before future possible imports?
I confess that I’m kind of worried about %import: the infrastructure
of Bison is far from being ready.  We don’t really have an AST for
instance, so parsing several different files will not be convenient.
And I’m not particularly impatient of writing an AST in C.  I wish
Bison were in C++…

reply via email to

[Prev in Thread] Current Thread [Next in Thread]