Re: Initial %include implementation

bison-patches
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Initial %include implementation

From:	Akim Demaille
Subject:	Re: Initial %include implementation
Date:	Tue, 20 Nov 2018 08:37:58 +0100
Hi David,

Thanks a lot for the very detailed answer!  There are plenty of
experience to benefit from your work.


> Le 18 nov. 2018 à 00:52, David M. Warme <address@hidden> a écrit :
> 
> The %include feature (as coded in this patch) is not terribly useful.
> Here is why:
> 
> A proper bison source file is partitioned into various sections,[…]

What you described is really close to what I had in mind.  Note
however that Bison was already modified to make things like this
easier.  For instance, it was made possible to have several %union
clauses, they simply get merged.

Actually, I dislike %union, that’s why I introduced %define
api.value.type union:

    %define api.value.type union
    %token <int> INT "integer"
    %token <char *> STRING "string"

    /* In yylex().  */
    yylval.INT = 42; return INT;
    yylval.STRING = "42"; return STRING;

Did you consider using this instead of %union?  Or do you see some
value in %union that api.value.type=union does not offer?

Also, I guess you kind of forked Bison and don’t track to follow
its evolutions, do you?  I remember that you were among those who
needed some of the tables that we no generate; were these tables
playing in role in this modular-grammars support?  Or were they
for completely different issues?


> In reality, you probably want each include file to have structure
> similar to the top-level grammar file, with 5 individual sections of
> its own.

Right.  That’s where currently the implementation of Bison shows
its limitations.  The parser generates a structure that is somewhat
like at AST, but it’s already tailored for the phases that follow.
For instance, I’m worried with the symbol numbers, etc.



> I actually implemented a facility to do exactly this more than 20 years
> ago -- but I implemented it as a set of GNU M4 macros, not internally
> to Bison.

So you really developed a preprocessor to Bison, right?
You didn’t even have to hack Bison, or did you?



> This is just the "tip of the iceberg," however.  The next thing you
> encounter when partitioning grammars into modules like this is token
> numbers.

Yep.  I’m also worried with associativity/precedence directives,
with unexpected fusions between non-terminals that were meant
to be different, but just happen to have the same name, etc.

Do you have any form of scoping in your modules?  Do you do anything
special for precedence for instance?

> Each top-level grammar "sees" a distinct set of terminal
> symbols -- and Bison assigns a unique token number to each of these
> terminals.[…]

How are written your scanners?  Is it by hand?  Using some
generator?  It seems to me that if Bison were to support
the scanning phase itself, the problem would be solved,
wouldn’t it?  Or even if it were able to spit a Lex file
from an augmented grammar file with regexes.  Of course,
it would not be as powerful as a tuned use of Lex, but maybe
you didn’t need anything fancy in the scanner?



>  1. Write a separate, custom lexical analyzer (floor to ceiling)
>     for each top-level grammar that you process.  (Yikes!)

The Yikes is about the source code and the efforts put on
the developers, and not about the code bloat in the object code,
right?


>  2. Write some sort of "templated" lexical analyzer that is
>     shared by all top-level grammars, but that you must re-compile
>     separately for each top-level grammar.  For example, it could
>     #include the "y.tab.h" file that is associated with the
>     particular top-level grammar this lexical analyzer will be
>     used with, and do #ifdef LEFT_PAREN ... #endif type stuff to
>     determine whether this grammar needs to recognize "(" tokens
>     or not.  Still pretty gnarly, and it makes the lexical analyzer
>     a bottleneck for everyone who is writing new grammar (e.g.,
>     adding new keywords, etc.)
> 
>  3. Write a single "chameleon" lexical analyzer that can dynamically
>     configure itself to the particular grammar it is being asked to
>     lex.  Using a custom Bison parser skeleton, for example, one can arrange
>     to have the following code executed early in the initialization
>     phases of the generated yyparse() function:
> 
>       d_lexer -> initSpecialTokens (YYNTOKENS, yytname);
> 
>     This function slogs through the yytname[] array, deducing the
>     token number of each terminal symbol, and what "lexical syntax"
>     is required for each such token number.  Here is a notional
>     yytname[] that I extracted (and cleansed) from one of our grammars:
> 
>       #define YYNTOKENS       43
> 
>       static const char *const yytname[] =
>       {
>         "$end", "error", "$undefined", "IDENT", "STRING", "CCONST", "ICONST",
>         "FCONST", "\",\"", "\"or\"", "\"and\"", "\"eq\"", "\"ne\"", "\"<\"",
>         "\"<=\"", "\">\"", "\">=\"", "\"+\"", "\"-\"", "\"*\"", "\"/\"", 
> "\"%\"",
>         "\"(\"", "\")\"", "\"not\"", "\"true\"", "\"false\"", "\"::\"", 
> "\"[\"",
>         "\"]\"", "\":\"", "\"{\"", "\"}\"", "\"keyword1\"", "\"keyword2\"",
>         "\"keyword3\"", "\"begin!\"", "\"end!\"", "\";\"", "\"=\"",
>         "\"keyword4\"", "\"keyword5\"", "\"keyword6\"", "$accept",
>         /* Other entries for non-terminal symbols omitted. */
>       };

You appear to really depend a lot on the syntax of yytname,
*please*, have a look at this proposal, and comment it!  I _need_
feedback.

https://lists.gnu.org/archive/html/bison-patches/2018-11/msg00030.html


>     Our "chameleon" lexer internally knows about whitespace, C and C++
>     style comments, and the following "primitive" token types:
> 
>       IDENT           - C or C++ style identifiers.
>       STRING          - Double-quoted strings.
>       CCONST          - Single-quoted strings (character constants).
>       ICONST          - Integer constants
>       FCONST          - Floating-point constants
> 
>     All other tokens in yytname[] must be double-quoted strings that split
>     cleanly into one of the following two buckets:
> 
>     a. Keywords.  These are first recognized as an IDENT, but then looked up
>       in a hash table of "exceptions".  (Our implementation allows you to make
>       the keyword be "reserved" if the yytname[] entry ends with a !, (e.g.,
>       "\"begin!\"" and "\"end!\"" in the above example).  In this case the
>       keyword's token number is always returned.  If the keyword is NOT
>       reserved, our implementation checks to see if the keyword's token
>       number is "shiftable" in the current parsing context.  If so, it
>       returns the keyword's token number, otherwise it returns IDENT's
>       token number.)

That’s interesting!  So they are not really reserved words, they behave like
say ‘override' in C++.



>     b. Punctuators.  These consist of one or more entirely non-alphabetic,
>       non-digit characters.  (Things containing /*, */ and // are illegal.)
> 
>     After analyzing each token entry of yytname[], our chameleon lexer sets
>     up all of the hash tables (for keywords), and trie’s (for punctuators)

A trie???  How many punctuators do you have for a trie to be relevant?
Or maybe it’s for an easy implementation of the longest-match?


>     This gives us the luxury of a single lexical analyzer (fairly complex, but
>     write it once and be done with it), and then the ability to write grammar
>     fragments such as the following:
> […]

This is nice.

>       /* In the "inum.gh" grammar header file... */
> 
>       Grammar_Declarations
> 
>       %type <y_long>          inum
> 
>       inum:
>               | ICONST        { $$ = $1; }
>               | "+" ICONST    { $$ = + $2; }
>               | "-" ICONST    { $$ = - $2; }
>               ;

What do you do about conflicts?



> There are some other things that you discover when you go down this road.
> Bison starts generating lots of error messages.  For example, when your
> last grammatical reference to "distribution" gets deleted from your top-
> level grammar, Bison complains about distribution, constant_distribution,
> uniform_distribution, normal_distribution, exponential_distribution --
> and perhaps even fnum and inum -- as rules/non-terminals that are defined
> but never used by the grammar.

If the -W categories are not fine grained enough, we should improve
them.


>  We implemented the following:
> 
>       %begin-library "fnum.gh"
> 
>       ...
> 
>       %end-library
> 
> All of the rules and declarations occurring between these two directives
> are considered to be part of a library (or module) named "fnum.gh".
> It is acceptable to use only a portion of the rules/symbols appearing in
> a library, and (our modified version of) Bison does not report any of
> them as being "useless".  However, if *all* of them are unused, then it
> instead reports that the corresponding "library" is useless.

Neat!


> I regret that I cannot share our actual implementation of all this with
> you because the code is "owned" by the US Government, and I do not have
> the authority to release any of it.  None of it is rocket science,
> however.  Once you understand the issues that need to be addressed, it
> is pretty clear how to do so.

Well, doing this in bison itself is certainly not rocket science
either, but the design space is huge, and the current implementation
too low-level.



> We implemented the basic infrastructure more than 20 years ago, and did
> some significant additional renovations about 5 years ago:

What version of Bison is your basis?

>       - The %begin-library and %end-libary directives.
> 
>       - Make Include_Grammar automatically implement "include guard"
>         functionality that prevents multiple inclusion.
> 
>       - Modify M4 (that still implements our "grammar file inclusion"
>         capability) with the ability to automatically create "make"
>         dependencies.  (The M4 maintainers have *still* not included
>         these enhancements into any released version of M4.)

That’s a pity.  This is something that Bison would benefit from too.
And Autoconf, and many others.



> Although I cannot contribute this code, I hope that the knowledge and
> experience that I have gained making all of this machinery work (well)
> can help inform the Bison developer community about what works, what
> does not, and why.

Thanks a lot for taking the time to craft a precise description of
your work!


I have one last question: why Bison?  For instance, ANTLR would
probably also have made sense, and it features things that Bison
does not.

Cheers!
[Prev in Thread]
Current Thread
[Next in Thread]
Initial %include implementation, Egor Pugin, 2018/11/08
- Re: Initial %include implementation, Akim Demaille, 2018/11/10
  - Re: Initial %include implementation, Egor Pugin, 2018/11/10
    - Re: Initial %include implementation, Akim Demaille, 2018/11/11
    - Re: Initial %include implementation, Egor Pugin, 2018/11/11
    - Re: Initial %include implementation, David M. Warme, 2018/11/17
    - Re: Initial %include implementation, Akim Demaille <=
    - Re: Initial %include implementation, David M. Warme, 2018/11/26
Prev by Date: Re: Initial %include implementation
Next by Date: style: avoid lengthy actions
Previous by thread: Re: Initial %include implementation
Next by thread: Re: Initial %include implementation
Index(es):
- Date
- Thread