[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 00/17] RFC: multiple start symbols

From: Akim Demaille
Subject: [PATCH 00/17] RFC: multiple start symbols
Date: Sun, 20 Sep 2020 10:37:32 +0200


This is something I wanted to work on for quite a long time: allowing to
define several entry points in the grammar.  What follows is far from
being complete:
- it works only with yacc.c,
- I have not yet looked at special cases such as push parsers,
- there is no documentation, 
- nothing in the test suite (but examples/c/lexcalc does demonstrate it).

However, before proceeding I would like to get feedback.

Start symbols can be either nonterminal, or tokens (which might be useful
for people willing to write test cases).

Start symbols may have be valueless, or have a semantic value.  In the
latter case, "api.value.type union" _must_ be used, so that Bison knows
exactly the type of the semantic value, which allows to provide a dedicated
parsing function returning the symbol's value.

Let's consider lexcalc as an example:

    %define api.value.type union
    %token <int> NUM "number"
    %type <int> exp expression line
    %start input expression NUM

produces three parsing functions.

Since "input" is valueless, its generated parsing function is:

    // Return type when parsing one input.
    typedef struct
      int yystatus;
      int yynerrs;
    } yyparse_input_t;
    // Parse one input.
    yyparse_input_t yyparse_input (void);

Having new functions is a chance to get the signature right: the traditional
one really looks like a hack on top of a design made for global variables:
capturing the semantic value of the parse, getting the number of errors,
etc. was clumsy.  Hence these new structs, returned by the specific parsing

In the case of symbols with values, one gets:

    // Return type when parsing one expression.
    typedef struct
      int yyvalue;
      int yystatus;
      int yynerrs;
    } yyparse_expression_t;
    // Parse one expression.
    yyparse_expression_t yyparse_expression (void);

If you're curious, multiple-start-symbols support uses an old trick:
inserting a "switching token" for each start symbol, and have the "real"
initial symbol branch on the corresponding rule.  In the case of lexcalc,
the report shows this:

    0 $accept: YY_PARSE_NUM "number" $end
    1        | YY_PARSE_expression expression $end
    2        | YY_PARSE_input input $end

The parsing function just passes the switching token as initial token to the
"real" parsing function (yyparse_yyimpl):

    yyparse_expression (void)
      yyparse_expression_t yyres;
      yyparse_yyimpl_t yyimpl;
      yyres.yystatus = yyparse_yyimpl (YY_PARSE_expression, &yyimpl);
      yyres.yyvalue = yyimpl.yyvalue.TOK_expression;
      yyres.yynerrs = yyimpl.yynerrs;
      return yyres;

Having several initial rules changes the parser termination: we used to
check if we reached _the_ final state, but now there are several.  Rather
than making the condition more complex and costly (it was run each time we
entered a new state), I am now recognizing the reduction of the start rules
themselves (so this is checked only when reducing, in which/case we already
have a switch/case statement for the action).

So the generated actions look like this:

    switch (yyn)
    case 1: /* $accept: YY_PARSE_NUM "number" $end  */
    { (yyimpl->yyvalue.TOK_NUM) = (yyvsp[-1].TOK_NUM); YYACCEPT; }
    case 2: /* $accept: YY_PARSE_expression expression $end  */
    { (yyimpl->yyvalue.TOK_expression) = (yyvsp[-1].TOK_expression); YYACCEPT; }
    case 3: /* $accept: YY_PARSE_input input $end  */
    { YYACCEPT; }

The branch is available too.

Comments most welcome.

Akim Demaille (17):
  gram: more debugging information
  reader: get ready to create several initial rules
  parser: expose a list of symbols
  multistart: turn start symbols into rules on $accept
  multistart: adjust computation of initial core and adjust reports
  multistart: also check the HTML report
  multistart: pass the list of start symbols to the backend
  multistart: equip yacc.c
  multistart: toy with it in lexcalc
  todo: more
  multistart: adjust reader checks for generated rules
  multistart: use b4_accept instead of action post-processing
  multistart: allow tokens as start symbols
  yacc.c: also count calls to YYERROR in yynerrs
  multistart: also give access to yynerrs

 TODO                            |   38 +
 data/                  |    8 +-
 data/skeletons/bison.m4         |   14 +
 data/skeletons/yacc.c           |  101 +-
 examples/c/lexcalc/lexcalc.test |   30 +-
 examples/c/lexcalc/parse.y      |   54 +-
 examples/c/lexcalc/scan.l       |    4 +-
 src/gram.c                      |   19 +-
 src/graphviz.c                  |   14 +-
 src/lr0.c                       |    7 +-
 src/main.c                      |    1 +
 src/output.c                    |   22 +
 src/parse-gram.c                |  576 +++++------
 src/parse-gram.h                |   14 +-
 src/parse-gram.y                |   27 +-
 src/print-xml.c                 |    9 +-
 src/print.c                     |    7 +-
 src/reader.c                    |  234 +++--
 src/reader.h                    |    9 +-
 src/reduce.c                    |   19 +-
 src/scan-code.h                 |    7 +
 src/scan-code.l                 |   18 +-
 src/symtab.c                    |   22 +-
 src/symtab.h                    |    5 -
 tests/                 |   66 +-
 tests/                 | 1597 +++++++++++++++++++++++++++++++
 26 files changed, 2469 insertions(+), 453 deletions(-)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]