bug-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Error UTF-8 strings


From: Hans Åberg
Subject: Re: Error UTF-8 strings
Date: Mon, 22 Jun 2020 10:40:55 +0200

> On 22 Jun 2020, at 07:59, Akim Demaille <akim@lrde.epita.fr> wrote:
> 
>> Le 21 juin 2020 à 15:24, Hans Åberg <haberg-1@telia.com> a écrit :
>> 
>> 
>>> On 21 Jun 2020, at 14:25, Hans Åberg <haberg-1@telia.com> wrote:
>>> 
>>>> On 21 Jun 2020, at 11:45, Akim Demaille <akim@lrde.epita.fr> wrote:
>>>> 
>>>> What locale are you using?
>>> 
>>> LC_CTYPE=UTF-8
>> 
>> The error goes away if setting LC_CTYPE=en_US.UTF-8 before recompiling the 
>> .yy file.
>> 
>> UTF-8 is language independent, so MacOS uses LC_CTYPE=UTF-8, but there are 
>> software that require a prefix.
> 
> Hans,
> 
> This is double-escaping of the UTF-8 characters is a well known problem
> of parse.error=verbose, that resulted in the introduction of "detailed"
> parse.error.  That was discussed extensively on Bison's lists, and is
> documented in NEWS of 3.6:
> 
> 
> 
> *** Improved syntax error messages
> 
>  Two new values for the %define parse.error variable offer more control to
>  the user.  Available in all the skeletons (C, C++, Java).
> 
> **** %define parse.error detailed
> 
>  The behavior of "%define parse.error detailed" is closely resembling that
>  of "%define parse.error verbose" with a few exceptions.  First, it is safe
>  to use non-ASCII characters in token aliases (with 'verbose', the result
>  depends on the locale with which bison was run).  Second, a yysymbol_name
>  function is exposed to the user, instead of the yytnamerr function and the
>  yytname table.  Third, token internationalization is supported (see
>  below).

The question is if that helps, as it is the yytname_ that is translated 
according to the LC_CTYPE environment variable.

This also introduces a locale dependency in the Bison compilation, so that the 
generated parser no longer is platform independent.

> Besides, I have recently posted that Bison 3.7 will also make another step:
> 
> 
> 
> *** String aliases are faithfully propagated
> 
>  Bison used to interpret user strings (i.e., decoding backslash escapes)
>  when reading them, and to escape them (i.e., issue non-printable
>  characters as backslash escapes, taking the locale into account) when
>  outputting them.  As a consequence non-ASCII strings (say in UTF-8) ended
>  up "ciphered" as sequences of backslash escapes.  This happened not only
>  in the generated sources (where the compiler will reinterpret them), but
>  also in all the generated reports (text, xml, html, dot, etc.).  Reports
>  were therefore not readable when string aliases were not pure ASCII.
>  Worse yet: the output depended on the user's locale.
> 
>  Now Bison faithfully treats the string aliases exactly the way the user
>  spelled them.  This fixes all the aforementioned problems.  However, now,
>  string aliases semantically equivalent but syntactically different (e.g.,
>  "A", "\x41", "\101") are considered to be different.

This besides might help.

> So, there is no new bug in 3.6 here, just something that is well known for
> ages, about which you and I already discussed.

Yes, there is, translation dependent on LC_CTYPE, which was not before.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]