[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Error UTF-8 strings
From: |
Hans Åberg |
Subject: |
Re: Error UTF-8 strings |
Date: |
Mon, 22 Jun 2020 10:40:55 +0200 |
> On 22 Jun 2020, at 07:59, Akim Demaille <akim@lrde.epita.fr> wrote:
>
>> Le 21 juin 2020 à 15:24, Hans Åberg <haberg-1@telia.com> a écrit :
>>
>>
>>> On 21 Jun 2020, at 14:25, Hans Åberg <haberg-1@telia.com> wrote:
>>>
>>>> On 21 Jun 2020, at 11:45, Akim Demaille <akim@lrde.epita.fr> wrote:
>>>>
>>>> What locale are you using?
>>>
>>> LC_CTYPE=UTF-8
>>
>> The error goes away if setting LC_CTYPE=en_US.UTF-8 before recompiling the
>> .yy file.
>>
>> UTF-8 is language independent, so MacOS uses LC_CTYPE=UTF-8, but there are
>> software that require a prefix.
>
> Hans,
>
> This is double-escaping of the UTF-8 characters is a well known problem
> of parse.error=verbose, that resulted in the introduction of "detailed"
> parse.error. That was discussed extensively on Bison's lists, and is
> documented in NEWS of 3.6:
>
>
>
> *** Improved syntax error messages
>
> Two new values for the %define parse.error variable offer more control to
> the user. Available in all the skeletons (C, C++, Java).
>
> **** %define parse.error detailed
>
> The behavior of "%define parse.error detailed" is closely resembling that
> of "%define parse.error verbose" with a few exceptions. First, it is safe
> to use non-ASCII characters in token aliases (with 'verbose', the result
> depends on the locale with which bison was run). Second, a yysymbol_name
> function is exposed to the user, instead of the yytnamerr function and the
> yytname table. Third, token internationalization is supported (see
> below).
The question is if that helps, as it is the yytname_ that is translated
according to the LC_CTYPE environment variable.
This also introduces a locale dependency in the Bison compilation, so that the
generated parser no longer is platform independent.
> Besides, I have recently posted that Bison 3.7 will also make another step:
>
>
>
> *** String aliases are faithfully propagated
>
> Bison used to interpret user strings (i.e., decoding backslash escapes)
> when reading them, and to escape them (i.e., issue non-printable
> characters as backslash escapes, taking the locale into account) when
> outputting them. As a consequence non-ASCII strings (say in UTF-8) ended
> up "ciphered" as sequences of backslash escapes. This happened not only
> in the generated sources (where the compiler will reinterpret them), but
> also in all the generated reports (text, xml, html, dot, etc.). Reports
> were therefore not readable when string aliases were not pure ASCII.
> Worse yet: the output depended on the user's locale.
>
> Now Bison faithfully treats the string aliases exactly the way the user
> spelled them. This fixes all the aforementioned problems. However, now,
> string aliases semantically equivalent but syntactically different (e.g.,
> "A", "\x41", "\101") are considered to be different.
This besides might help.
> So, there is no new bug in 3.6 here, just something that is well known for
> ages, about which you and I already discussed.
Yes, there is, translation dependent on LC_CTYPE, which was not before.
- Error UTF-8 strings, Hans Åberg, 2020/06/20
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/21
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/21
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/21
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/21
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/21
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/22
- Re: Error UTF-8 strings,
Hans Åberg <=
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/23
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/23
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/23
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/24
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/24
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/23
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/24
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/24
- Re: Error UTF-8 strings, Akim Demaille, 2020/06/24
- Re: Error UTF-8 strings, Hans Åberg, 2020/06/24