[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RFC: custom error messages

From: Akim Demaille
Subject: RFC: custom error messages
Date: Fri, 3 Jan 2020 11:07:05 +0100

Hi all,

There is quite some demand to provide the users with more freedom
to customize the error messages.  There's also demand for us to stop
double-quoting the token names in the table of the symbol names.  For
instance in Bison's own parser we have

| static const char *const yytname[] =
| {
|   "\"end of file\"", "error", "$undefined", "\"string\"", "\"%token\"",
|   "\"%nterm\"", "\"%type\"", "\"%destructor\"", "\"%printer\"",
|   "\"%left\"", "\"%right\"", "\"%nonassoc\"", "\"%precedence\"",
|   "\"%prec\"", "\"%dprec\"", "\"%merge\"", "\"%code\"",
|   "\"%default-prec\"", "\"%define\"", "\"%defines\"", "\"%error-verbose\"",
|   "\"%expect\"", "\"%expect-rr\"", "\"%<flag>\"", "\"%file-prefix\"",
|   "\"%glr-parser\"", "\"%initial-action\"", "\"%language\"",
|   "\"%name-prefix\"", "\"%no-default-prec\"", "\"%no-lines\"",
|   "\"%nondeterministic-parser\"", "\"%output\"", "\"%pure-parser\"",
|   "\"%require\"", "\"%skeleton\"", "\"%start\"", "\"%token-table\"",
|   "\"%verbose\"", "\"%yacc\"", "\"{...}\"", "\"%?{...}\"",
|   "\"[identifier]\"", "\"character literal\"", "\":\"", "\"epilogue\"",
|   "\"=\"", "\"identifier\"", "\"identifier:\"", "\"%%\"", "\"|\"",
|   "\"%{...%}\"", "\";\"", "\"<tag>\"", "\"<*>\"", "\"<>\"", "\"integer\"",
|   "\"%param\"", "\"%union\"", "\"%empty\"", "$accept", "input",
|   "prologue_declarations", "prologue_declaration", "$@1", "params",
|   "grammar_declaration", "code_props_type", "union_name",
|   "symbol_declaration", "$@2", "$@3", "precedence_declarator", "tag.opt",
|   "generic_symlist", "generic_symlist_item", "tag", "nterm_decls",
|   "token_decls", "token_decl.1", "token_decl", "int.opt",
|   "token_decls_for_prec", "token_decl_for_prec.1", "token_decl_for_prec",
|   "symbol_decls", "symbol_decl.1", "grammar",
|   "rules_or_grammar_declaration", "rules", "$@4", "rhses.1", "rhs",
|   "named_ref.opt", "variable", "value", "id", "id_colon", "symbol",
|   "string_as_id", "string_as_id.opt", "epilogue.opt", YY_NULLPTR
| };

This double-escaping wrecks non-ASCII string literals, such as
mathematical symbols written in UTF-8.  It also is a problem when you
want to translate your symbols.

About one year ago I made a first attempt to address this:  That
series of patches completely changed the semantics of yytname (by
removing this double-quotation).  I was quite happy with the result (I
liked a lot the fact that there was just one implementation, not a
host of alternatives, as that's the kind of things that make Bison
hard to maintain).  I thought we were done, but there were a few

One severe issue brought to my attention by Rici Lake (unfortunately
privately, although he had written a very nice and detailed mail with
all the details) is that this would break several existing parsers
that expect yytname to be this way.  For instance he pointed to;a=blob;f=src/asn1-parse.y;h=5bff15cd8db64786f7c9e2ef000aeddd583cfdc0;hb=HEAD#l856

currently not responding, but the code is:

| for (k = 0; k < YYNTOKENS; k++)
|   {
|     if (yytname[k] && yytname[k][0] == '\"'
|         && !strncmp (yytname[k] + 1, string, len)
|         && yytname[k][len + 1] == '\"' && !yytname[k][len + 2])
|       return yytoknum[k];
|   }

While I was okay that my changes would modify some error messages of
some parsers, I did not want to break parsers.  Note that the core
problem here is that some people use yytname to write their scanners.
In other words, they use the string literal not as a means to
customize the error messages, but to encode in some way their scanner.

We cannot satisfy both needs (better error messages and helping the
definition of the scanner) at the same time.  I fully support the use
of string literals for better error messages, and will not help to use
them for the scanner.  I will pay attention not to break existing
code, but I will not make any efforts to make it easier, nor to be
portable across skeletons (in the future though, Bison will help these
users, but by a totally different means).

I looked at how people use yytnamerr on GitHub (see at the end of this
message) and I would summarize that:

- some users "fight" the double quotation there

- some users want to translate the token names

- some users want to customize their error messages and try to
  shoehorn that into yytnamerr.  The example of PHP is particularly

I think we can satisfy everybody by offering alternatives to "%define
parse.error verbose".

With "%define parse.error rich" (for lack of a better idea,
suggestions most welcome), you get something equivalent to 'verbose'
but (i) there is no double quotation, (ii) there is support for symbol
translation (as in  This
would _not_ support yytnamerr.

With "%define parse.error custom", the user must define the
yyreport_syntax_error function, which is fully responsible to emit the
syntax error message.  In the branch yyreport_syntax_error (see, the example looks like this:

| int
| yyreport_syntax_error (const yysyntax_error_context_t *ctx)
| {
|   /* Arguments of yyformat: reported tokens (one for the "unexpected",
|      one per "expected"). */
|   char const *arg[YYERROR_VERBOSE_ARGS_MAXIMUM];
|   int n = yysyntax_error_arguments (ctx, arg, sizeof arg / sizeof *arg);
|   if (n == -2)
|     return 2;
|   fprintf (stderr, "SYNTAX ERROR on token [%s]", arg[0]);
|   if (1 < n)
|     {
|       fprintf (stderr, " (expected:");
|       for (int i = 1; i < n; ++i)
|         fprintf (stderr, " [%s]", arg[i]);
|       fprintf (stderr, ")");
|     }
|   fprintf (stderr, "\n");
|   return 0;
| }

Again, this would not use yytnamerr; it's up to the user to implement
and call whatever transformation she wants on the error message.  And
there would be no double quotation in yytname.

Chistian made a strong case in favor of what he called "push" API
rather than pull:

May 11th 2019, Christian Schoenebeck <address@hidden>

> My old concerns still apply, that is I still find a push design
> (that is you call Bison API functions you need explicitly in your
> yyerror() implementation to get just the information you need for
> the error message) more appropriate, flexible, powerful and API
> stable than a pull design (where you would get a predefined list of
> information passed to your yyerror_whatever() implementation; like
> eat it or leave it).
> The push design has various advantages. E.g. Bison can easily be
> extended with new API functions if somebody needs another
> information, without breaking the previous API and people's existing
> yyerror() implementations. No matter what kind of arguments Bison
> would pass to that suggested yyerror_*() function today, I am sure
> later on people would ask about different information to be provided
> by Bison as well, or in a different way. Some of them might even be
> exotic from your point of view, but legitimate for certain use
> cases, some might become replaced & deprecated, etc. Plus developers
> could duplicate and modify the parser state without touching the
> current state to see potential future information, etc.

I think he is right, hence the call to yysyntax_error_arguments which
returns the list of expected/unexpected tokens.

I can't make up my mind on whether returning the list of expected
tokens as strings (as exemplified above), or simply as their symbol
numbers.  Symbol numbers are more efficient, yet they are the
*internal* symbol numbers, not the ones the user is exposed to.

I'm not sure I really want the unexpected token to be in the same
array.  Maybe some day we would like to have $undefined have a string
semantic value, where we would write the unexpected string:

$ echo "1+&" | ./_build/g9d/examples/c/calc/calc
SYNTAX ERROR on string [&] (expected: ["number"] ['('])

would look better than

$ echo "1+&" | ./_build/g9d/examples/c/calc/calc
SYNTAX ERROR on token [$undefined] (expected: ["number"] ['('])

If you are familiar with the current implementation of 'verbose' error
reporting in yacc.c, you'll see that you cannot emulate it with the
custom API I propose.  That's because the 'verbose' implementation
goes into a lot of troubles so that the error message is built and
stored in yyparse's stack frame (using alloca).  That requires a two
step calling convention, where first one would first ask "how much
space do I need for the error message" and a last one is "I allocated
this much space here, please strcpy the error message here".  This is
too complex, I don't think we should support such a complex protocol.

I also plan to get rid of YYERROR_VERBOSE.  Juggling with CPP and M4
at the same time is a nightmare.  YYERROR_VERBOSE was deprecated
looong ago (Bison 2.6, 2012-07-19, claims that's deprecated since

I would really appreciate comments on this proposal.  Sorry for making
it so long.


# hayate891/linuxsampler -- Parser for NKSP real-time instrument script language

They have single quotes in their string literals, yet they don't want
the outer quotes (which are not removed by the default yytnamerr if
there are single quotes).  IOW, would not be needed with my changes.
/// Custom implementation of yytnamerr() to ensure quotation is always stripped 
from token names before printing them to error messages.
int InstrScript_tnamerr(char* yyres, const char* yystr) {
  if (*yystr == '"') {
      int yyn = 0;
      char const *yyp = yystr;
      for (;;)
        switch (*++yyp)
          case '\'':
          case ',':
            goto do_not_strip_quotes;
          case '\\':
            if (*++yyp != '\\')
              goto do_not_strip_quotes;
            /* Fall through.  */
            if (yyres)
              yyres[yyn] = *yyp;

          case '"':
            if (yyres)
              yyres[yyn] = '\0';
            return yyn;
    do_not_strip_quotes: ;

  if (! yyres)
    return (int) yystrlen (yystr);

  return int( yystpcpy (yyres, yystr) - yyres );

# Ortelius maps your microservice configurations with their relationships to 
the application that use them

Display literal tokens with quotes around them (e.g., as [unexpected
"action"], not as [unexpected "\"action\""].

Would not be needed with straight display.

%token <bval> T_BOOL    "boolean value"
%token <ival> NUM       "integer value"
%token <str> STR STR2   "string literal"
%token <str> IDENT      "identifier"
%token <str> DOLIDENT   "$identifier"
%token T_ACTION         "\"action\""
%token T_BREAK          "\"break\""
%token T_CASE           "\"case\""
%token T_CATCH          "\"catch\""

case '\\':
        /* RHT mod - allow double-quotes to be escaped */
        if((*yyp != '\\') && (*yyp != '"')) {
                goto do_not_strip_quotes;


%token <ast> T_LNUMBER   "integer number (T_LNUMBER)"
%token <ast> T_DNUMBER   "floating-point number (T_DNUMBER)"
%token <ast> T_STRING    "identifier (T_STRING)"
%token <ast> T_VARIABLE  "variable (T_VARIABLE)"
%token <ast> T_INLINE_HTML
%token <ast> T_ENCAPSED_AND_WHITESPACE  "quoted-string and whitespace 
%token <ast> T_CONSTANT_ENCAPSED_STRING "quoted-string 
%token <ast> T_STRING_VARNAME "variable name (T_STRING_VARNAME)"
%token <ast> T_NUM_STRING "number (T_NUM_STRING)"

They maintain state between the different calls to yytname to know
whether its the unexpected token, or one of the expected.  To this end
they use `CG(parse_error)`, changed in yytnamerr.
# define CG(v) (compiler_globals.v)

/* Copy to YYRES the contents of YYSTR after stripping away unnecessary
   quotes and backslashes, so that it's suitable for yyerror.  The
   heuristic is that double-quoting is unnecessary unless the string
   contains an apostrophe, a comma, or backslash (other than
   backslash-backslash).  YYSTR is taken from yytname.  If YYRES is
   null, do not copy; instead, return the length of what the result
   would have been.  */
static YYSIZE_T zend_yytnamerr(char *yyres, const char *yystr)
        /* CG(parse_error) states:
         * 0 => yyres = NULL, yystr is the unexpected token
         * 1 => yyres = NULL, yystr is one of the expected tokens
         * 2 => yyres != NULL, yystr is the unexpected token
         * 3 => yyres != NULL, yystr is one of the expected tokens
        if (yyres && CG(parse_error) < 2) {
                CG(parse_error) = 2;

        if (CG(parse_error) % 2 == 0) {
                /* The unexpected token */
                char buffer[120];
                const unsigned char *end, *str, *tok1 = NULL, *tok2 = NULL;
                unsigned int len = 0, toklen = 0, yystr_len;


                if (LANG_SCNG(yy_text)[0] == 0 &&
                        LANG_SCNG(yy_leng) == 1 &&
                        strcmp(yystr, "\"end of file\"") == 0) {
                        if (yyres) {
                                yystpcpy(yyres, "end of file");
                        return sizeof("end of file")-1;

                str = LANG_SCNG(yy_text);
                end = memchr(str, '\n', LANG_SCNG(yy_leng));
                yystr_len = (unsigned int)yystrlen(yystr);

                if ((tok1 = memchr(yystr, '(', yystr_len)) != NULL
                        && (tok2 = zend_memrchr(yystr, ')', yystr_len)) != 
                        toklen = (tok2 - tok1) + 1;
                } else {
                        tok1 = tok2 = NULL;
                        toklen = 0;

                if (end == NULL) {
                        len = LANG_SCNG(yy_leng) > 30 ? 30 : LANG_SCNG(yy_leng);
                } else {
                        len = (end - str) > 30 ? 30 : (end - str);
                if (yyres) {
                        if (toklen) {
                                snprintf(buffer, sizeof(buffer), "'%.*s' %.*s", 
len, str, toklen, tok1);
                        } else {
                                snprintf(buffer, sizeof(buffer), "'%.*s'", len, 
                        yystpcpy(yyres, buffer);
                return len + (toklen ? toklen + 1 : 0) + 2;

        /* One of the expected tokens */
        if (!yyres) {
                return yystrlen(yystr) - (*yystr == '"' ? 2 : 0);

        if (*yystr == '"') {
                YYSIZE_T yyn = 0;
                const char *yyp = yystr;

                for (; *++yyp != '"'; ++yyn) {
                        yyres[yyn] = *yyp;
                yyres[yyn] = '\0';
                return yyn;
        yystpcpy(yyres, yystr);
        return strlen(yystr);

GC(parse_error) is reset when the error message is reported:
ZEND_COLD void zenderror(const char *error) /* {{{ */
        CG(parse_error) = 0;

        if (EG(exception)) {
                /* An exception was thrown in the lexer, don't throw another in 
the parser. */

        zend_throw_exception(zend_ce_parse_error, error, 0);
/* }}} */


# PolishReadabilityChecker

They want i18n.  I don't see where the other tokens are defined though.

static size_t yytnamerr(char *yyres, const char *yystr)
   const char *translation = _(yystr);
   size_t length = strlen(translation);
   if (yyres != NULL)
      strcpy(yyres, translation);
   return length;
#if 0

# Daruma BASIC for Raspberry Pi and Mac OS X

They want I18n.

const char *err_msg[] = {
        /* BS_M_ONLY_ONE_DO_COND */
        "only one WHILE/UNTIL condition may appear with DO or LOOP keyword",
        /* BS_M_UNDEF_SYMBOL */
        "undefined symbol %s",

const char *translated_err_msg[] = {
        /* BS_M_ONLY_ONE_DO_COND */
        /* BS_M_UNDEF_SYMBOL */
unsigned int
my_yytnamerr(char *yyres, const char *yystr)
        /*  This is a patch to remove 'BS_' from the token name  */
        if (strncmp(yystr, "BS_", 3) == 0)
                yystr += 3;
        /*  End patch  */
/*  Reserved words  */

# Intel SPMD Program Compiler

yytnamerr starts with an addition to be able to rename some of the
tokens.  Otherwise, back to the default implementation.

This is C++, yet they use yacc.c.

static int
lYYTNameErr (char *yyres, const char *yystr)
  extern std::map<std::string, std::string> tokenNameRemap;
  Assert(tokenNameRemap.size() > 0);
  if (tokenNameRemap.find(yystr) != tokenNameRemap.end()) {
      std::string n = tokenNameRemap[yystr];
      if (yyres == NULL)
          return n.size();
          return yystpcpy(yyres, n.c_str()) - yyres;

  if (*yystr == '"')
      YYSIZE_T yyn = 0;
      char const *yyp = yystr;

      for (;;)
        switch (*++yyp)
          case '\'':
          case ',':
            goto do_not_strip_quotes;

          case '\\':
            if (*++yyp != '\\')
              goto do_not_strip_quotes;
            /* Fall through.  */
            if (yyres)
              yyres[yyn] = *yyp;

          case '"':
            if (yyres)
              yyres[yyn] = '\0';
            return yyn;
    do_not_strip_quotes: ;

  if (! yyres)
    return yystrlen (yystr);

  return yystpcpy (yyres, yystr) - yyres;
void ParserInit() {
    tokenToName[TOKEN_ASSERT] = "assert";
    tokenToName[TOKEN_BOOL] = "bool";
    tokenToName[TOKEN_BREAK] = "break";
    tokenToName[TOKEN_CASE] = "case";
    tokenToName[TOKEN_CDO] = "cdo";
    tokenToName[TOKEN_CFOR] = "cfor";
    tokenToName[TOKEN_CIF] = "cif";
    tokenNameRemap["TOKEN_ASSERT"] = "\'assert\'";
    tokenNameRemap["TOKEN_BOOL"] = "\'bool\'";
    tokenNameRemap["TOKEN_BREAK"] = "\'break\'";
    tokenNameRemap["TOKEN_CASE"] = "\'case\'";
    tokenNameRemap["TOKEN_CDO"] = "\'cdo\'";
    tokenNameRemap["TOKEN_CFOR"] = "\'cfor\'";

# Ruby

 * strip enclosing double-quotes, same as the default yytnamerr except
 * for that single-quotes matching back-quotes do not stop stripping.
 *  "\"`class' keyword\"" => "`class' keyword"
rb_yytnamerr(struct parser_params *p, char *yyres, const char *yystr)
    if (*yystr == '"') {
        size_t yyn = 0, bquote = 0;
        const char *yyp = yystr;


reply via email to

[Prev in Thread] Current Thread [Next in Thread]