Re: RFC: custom error messages

bison-patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: custom error messages

From:	Akim Demaille
Subject:	Re: RFC: custom error messages
Date:	Wed, 8 Jan 2020 07:37:29 +0100

Hi Rici!

Thanks a lot for your detailed answer.

> Le 6 janv. 2020 à 19:23, Rici Lake <address@hidden> a écrit :
> 
> Hi, Akim.
> 
> Sorry for the slow response,

No problem!  Anyway, I'm pretty slow at responding to currently.  I usually
spend some time on Bison when I'm in the public transportations, but there
are strikes right now that make it quite difficult to have time to spend
on Bison :)


> For as long as I have been aware, the `"literal"` syntax has been described
> in the bison manual as a "literal string token"
> <http://dinosaur.compilertools.net/bison/bison_6.html#SEC40> and the
> motivating example then and now is, in fact, a literal string (`"<="`).
> Like character tokens, literal string tokens provide a way to make the
> grammar reflect the actual tokens used in the language being parsed. It is
> similar to the notation used in ISO EBNF (although ISO EBNF does not
> distinguish between single- and double-quoted forms).
> 
> Unfortunately, no mechanism was ever proposed which allowed for the
> automated extraction of these keyword literals to be used in an
> automatically-generated scanner description. But various ad hoc mechanisms
> have been developed for particular projects, and some of them are still
> extent.

It depends what you mean by "extraction".  In fact, contrary to what I
believed, the piece of code to "extract" the literal strings was documented
since forever (Bison 1.25 is the old version of Bison I can find).

http://dinosaur.compilertools.net/bison/bison_7.html#SEC62


> At some point in the development of bison, it seems to have occurred to the
> team that literal string tokens could also be used to give human-readable
> and translatable names to non-keyword tokens (`"identifier"`).

On this, you are mistaken: Bison 1.25 implements YYERROR_VERBOSE and
uses the string aliases to improve the error messages.  For instance
with

> %token <ival> NUM "number"
> %token '+' "+"
> %type  <ival> expr term fact

it produces

> #if YYDEBUG != 0 || defined (YYERROR_VERBOSE)
> 
> static const char * const yytname[] = {   
> "$","error","$undefined.","\"number\"",
> "\"+\"","'\\n'","'-'","'*'","'/'","'('","')'","line","expr","term","fact", 
> NULL
> };
> #endif

and

> #ifdef YYERROR_VERBOSE
>       yyn = yypact[yystate];
> 
>       if (yyn > YYFLAG && yyn < YYLAST)
>         {
>           int size = 0;
>           char *msg;
>           int x, count;
> 
>           count = 0;
>           /* Start X at -yyn if nec to avoid negative indexes in yycheck.  */
>           for (x = (yyn < 0 ? -yyn : 0);
>                x < (sizeof(yytname) / sizeof(char *)); x++)
>             if (yycheck[x + yyn] == x)
>               size += strlen(yytname[x]) + 15, count++;
>           msg = (char *) malloc(size + 15);
>           if (msg != 0)
>             {
>               strcpy(msg, "parse error");
> 
>               if (count < 5)
>                 {
>                   count = 0;
>                   for (x = (yyn < 0 ? -yyn : 0);
>                        x < (sizeof(yytname) / sizeof(char *)); x++)
>                     if (yycheck[x + yyn] == x)
>                       {
>     *****               strcat(msg, count == 0 ? ", expecting `" : " or `");
>     *****               strcat(msg, yytname[x]);
>                         strcat(msg, "'");
>                         count++;
>                       }
>                 }
>               yyerror(msg);
>               free(msg);
>             }
>           else
>             yyerror ("parse error; also virtual memory exceeded");
>         }
>       else
> #endif /* YYERROR_VERBOSE */

The oldest revisions in git (which is clearly not the real origin of
Bison, it was entered into RCS after having lived outside of any VCS)
also have both features at the same time.  So AFAICT this duality is
there from the inception.


> Now, it appears that the secondary usage has become the primary usage.

Actually, the problem here is rather: which of two is really properly
covered?  It is clear to me that the definition of tokens is insuffisant.
And you agree with that:

> As a final note, once automated scanner generation is on the table, it
> could also become useful to provide some way of telling the scanner
> generator what scanner pattern expressions should be used for non-keyword
> tokens.

I fully subscribe to this view, but string literals are definitely not
the way to go.  So a few months ago I realized that what we really need
to do is to merge Joel E. Denny's PhD into Bison
(https://tigerprints.clemson.edu/all_dissertations/519/).

_That's_ the real way forward.  That's Bison 4.


> I don't really have a proposal here. Most of the automated scanner
> generators I've written assume that a quoted string is a literal unless it
> has at least one whitespace character, but that's not very satisfactory.

Actually I have been considering adding a CSV output with all the tokens
and their various traits.  That should help people who want to extract
the tokens.


> That was, I think, a very long non-response to your RFC. But I hope it was
> in some way useful.

It is definitely useful!  But I can't improve both uses at the same time.
I'm targeting error message improvements for Bison 3.6, and improving
the handling of the token definitions will be for Bison 4, expected to
be published some day in the current decade :)


> Let me add one more consideration, which is possibly
> more relevant, about whether quotes should or should not be included in
> name tables.
> 
> I think that the uses of the (or a) token name table which appear in the
> RFC assume that the token names will be used at runtime. In such use cases,
> the quotes are certainly an inconvenience, and removing them is an
> unnecessarily annoying project.

Yep.

> But another use case for the/a token name table is at code-generation time

I guess you now agree that string literals are _not_ satisfying that need,
so you see why I want to get rid of the quotes in the new values for
%define parse.error.  But "%define parse.error verbose" will still generate
yytname as it is.

> (whether the code being generated is a scanner or a literal error message,
> or something else). In this use case, it is extremely useful to have the
> quoted form, which is a valid C string literal whose value will be the
> token name when compiled. The quoting functions in bison cheerfully handle
> a number of corner cases which are awkward to implement, and which are
> often not even though about (such as tokens which happen to include
> trigrams).

Do you really still depend on trigraphs???


> Finally, thanks a lot for you many valuable contributions, and my best
> wishes for the new year.

Many thanks for your insight, and best wishes for the new decade ;)

Cheers!

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 07/12] yacc.c: isolate yyexpected_tokens, (continued)
- Re: RFC: custom error messages, Rici Lake, 2020/01/06
  - Re: RFC: custom error messages, Akim Demaille <=
- Re: RFC: custom error messages, Adrian Vogelsgesang, 2020/01/08
  - Re: RFC: custom error messages, Akim Demaille, 2020/01/08

Prev by Date: Re: RFC: custom error messages
Next by Date: Re: RFC: custom error messages
Previous by thread: Re: RFC: custom error messages
Next by thread: Re: RFC: custom error messages
Index(es):
- Date
- Thread