bison-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: custom error messages


From: Akim Demaille
Subject: Re: RFC: custom error messages
Date: Fri, 10 Jan 2020 07:23:43 +0100

Hi Christian!

> Le 9 janv. 2020 à 14:50, Christian Schoenebeck <address@hidden> a écrit :
> 
> On Sonntag, 5. Januar 2020 17:52:43 CET Akim Demaille wrote:
> 
>>> Why not making that a general-purpose function instead that users could
>>> call at any time with the current parser state:
>>> 
>>> // returns NULL terminated list
>>> const enum yysymbolid* yynextsymbols(const yystate* currentParserState);
>> 
>> I don't want to have to deal with allocating space.  Your proposal
>> needs to allocate space.  Hence the clumsy interface I provided :)
> 
> Well, allocation is just a minor API detail that could easily be addressed.

I wouldn't call memory management in yacc.c "easy": lots of efforts
are made to allocate on the call stack, and to avoid malloc.

> But if your suggested function is limited to error cases only (i.e. not 
> working at any parser state), that would IMO indeed be a show stopper on the 
> other hand for features like auto completion.

I kept on thinking on what you suggested, and I agree we should work
on this too.  In fact, I already started to explore this path
in my PR
(https://github.com/akimd/bison/pull/16/commits/1a4dfeff064bea6552f130859a81fcccb657c28d).


>> That might be doable with api.push.  I don't see that coming for
>> the pull interface.
> 
> Right, I forgot that the push parser option exists. [...]

Playing tricks with the parser overall state is really the point of
push-parsers, that's where new features such as duplicate_state would
be brought, not in the default pull parsers.


>>> But there should still be a way for people being able to convert that
>>> conveniently to its original string representation from source.y:
>>> 
>>> const char* yysymbolname(enum yysymbolid);
>> 
>> Yes, of course.  That's not "both", that's just what I refer
>> to by "exposing the numbers".  "yysymbolname(x)" is currently
>> just "yytname[x]".
> 
> Sure, it is just not clear to me what your actual future plans about 
> yytname[x] are; I see that you are constantly struggling with numerous issues 
> because of people using what were supposed to be skeleton internal-only data 
> structures due to lack of official public APIs. So that was my reason to 
> suggest considering to add official APIs for common specific use cases. E.g.:
> 
> /**
> * Name of the symbol (i.e. token name or LHS grammar rule name)
> * from input.y
> */
> const char* yysymbolname(enum yysymbolid);

I do not plan to expose an enum for symbol numbers.  What value would
it bring to give name to these numbers?

> /** Human readable error description for symbol. */
> const char* yysymbolerrdesc(enum yysymbolid);

I do not plan to expose several names for a given symbol.  Symbols
have one description, period.  That in yytname.


> On Montag, 6. Januar 2020 19:23:27 CET Rici Lake wrote:
>> So I think that there is still time to consider the wider question of how a
>> bison grammar might be able to show both literal keywords and
>> human-readable token descriptions in a way that is useful for both
>> applications. As a side-benefit, this might also make grammars easier to
>> read by humans, because the current mechanism does not make it clear to a
>> human reader in all cases whether a quoted string is intended to be
>> inserted literally into the parsed text, or whether it is necessary to hunt
>> through the grammar's token declarations to find the named token for which
>> the quoted string is an alias. (I've been fooled by this several times
>> while trying to read grammars, particularly when only an extract of the
>> grammar is presented.)
> 
> ++vote;

I already answered to that.  That's the plan for Bison 4, whose preparation
has already started.  But that's not for right now, and it's unlikely to
happen in 2020.

> I already read several people saying that it was not possible to address both 
> use cases. What am I missing here?
> 
> - It would require to auto generate a 2nd table.
> 
> - On grammer input side it would make sense to handle this issue by more  
>  specific declarations which reflect their intended semantics more 
>  appropriate, e.g.
> 
>       %token LE raw="<=" human-err="operator '<='"
> 
> if localization is desired (i.e. translations):
> 
>       %token LE raw="<=" human-err=_("operator '<='")

Sorry, *I will not support this*.  I made my mind.  These strings are
there for error message only.  This feature does not scale for the
real need, so I will not improve this ill-designed feature that does
not cover all the needs to generate a scanner.

Bison 3.6 will improve the generation of error messages for those who
want to switch the new system.

Bison 4 will address the *much* wider issue of the scanner interface.


> where e.g. gettext / Qt / Xcode, etc. would handle the actual human-readable 
> translation.
> 
> I know, long-term solution would be scanner features on Bison side. But in 
> the 
> meantime these declarations would be useful (also more readable), and it 
> won't 
> hurt to leave them in, even in a Bison-included-scanner era in future.

Yes it will: things to test, document, maintain; mails to answer to,
overlapping features that confuse the newcomers, tricky implementation
to support competing implementations for the same feature.  Not to
mention the waste of time that will only result in delaying Bison 4.

It will not happen.



> On Sonntag, 5. Januar 2020 17:30:18 CET Akim Demaille wrote:
>>> One could tackle this particular use case also from a different angle:
>>> We could introduce the concept of "opaque" rules, i.e. rules which are not
>>> expanded when reporting syntax errors.
>>> 
>>> E.g., if I could define "unreserved_keyword" as
>>> 
>>>> unreserved_keyword [opaqe]: ABORT_P | ABSOLUTE_P | <...>
>>> 
>>> bison should then create the error message
>>> 
>>>> expected: Identifier, unreserved_keyword
>>> 
>>> instead of
>>> 
>>>> expected: Identifier, <long list containing all unreserved keywords>
>> 
>> Too complex, and fitting just one special case.  With EDSLs, the
>> sheer concept of "keyword" is more complex than this.
> 
> Actually I was thinking about the exact same feature before that Adrian 
> suggested as "opaque" attribute here, for a different use case though: 
> parsers 
> without external scanner:
> 
> CREATE  :  'c''r''e''a''t''e'
>        ;
> 
> In that case you would like e.g. a syntax error message like
> 
>       "Expecting 'create', got 'foo'"
> 
> instead of
> 
>       "Expecting 'c', got 'f'"
> 
> I inject code to handle that ATM. I could imagine this to be controlled by 
> doxygen style comments, something like:
> 
> 
> /**
> * @symbol-visibility opaque
> */
> CREATE  :  'c''r''e''a''t''e'
>        ;
> 
> That would prevent backward compatiblity issues and would handle this 
> "detail" 
> feature in a graceful, non-invasive way.
> 
> I could imagine that as an alternative for my %token change suggestions above 
> BTW, that is e.g.:
> 
> /**
> * @raw "<="
> * @human-err "operator '<='"
> */
> %token LE

You clearly stepped out of the traditional dichotomy between the scanner
and the parser (for reasons I perfectly understand and respect).  However,
what you did is just a hack which will be obsoleted in Bison 4, so I will
not work on features that will be useless.

Really, the case of improving string literal support for the scanners is
closed: I won't work on that (with the exception of generating a CSV file
if there is demand for it).

The point here is improving access to the list of expected tokens in
error message generation, and possibly in actions, as exemplified in
the example in my PR:

> #define PRINT_EXPECTED_TOKENS()                                         \
>   do {                                                                  \
>     yysyntax_error_context_t yyctx                                      \
>       = {yyssp, yytoken, yyesa, &yyes, &yyes_capacity};                 \
>     if (yychar != YYEMPTY)                                              \
>       YY_LAC_ESTABLISH;                                                 \
>     int yytokens[YYNTOKENS];                                            \
>     int yycnt = yyexpected_tokens (&yyctx, yytokens, YYNTOKENS);        \
>     fprintf (stderr, "expected tokens (%d):", yycnt);                   \
>     for (int i = 0; i < yycnt; ++i)                                     \
>       fprintf (stderr, " %s", yytname[yytokens[i]]);                    \
>     fprintf (stderr, "\n");                                             \
>   } while (0)                                                           \
> }

(of course some details, such as yytname and yysyntax_error_context_t)
will be abstracted behind a proper API in the final version.)

and rules such as

> fact:
>   "number"                                  { PRINT_EXPECTED_TOKENS(); }
> | '(' expr { PRINT_EXPECTED_TOKENS(); } ')' { $$ = $expr; }
> ;


> $ echo "(1)" | ./_build/g9d/examples/c/calc/calc
> expected tokens (4): $end "number" '\n' '('
> expected tokens (5): '+' '-' '*' '/' ')'
> expected tokens (3): '+' '-' ')'
> 1
> expected tokens (4): $end "number" '\n' '('


Cheers!


reply via email to

[Prev in Thread] Current Thread [Next in Thread]