Re: RFC: enum instead of #define for tokens

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: enum instead of #define for tokens

From:	Hans Aberg
Subject:	Re: RFC: enum instead of #define for tokens
Date:	Sat, 6 Apr 2002 00:01:11 +0200

At 11:54 -0800 2002/04/05, Paul Eggert wrote:
>> Is this cross compiler problem common?
>
>It depends on what you mean by "common".  If you use EBCDIC it's
>common.  If you use the non-ASCII part of ISO 8859-1 and are
>collaborating with someone else who's using some other character set
>in the ISO 8859 series, it's common.  Assuming Bison supports
>multibyte character sets properly (isn't that how we got started on
>this thread?), a similar problem occurs with the non-ASCII parts of
>UTF-8, EUC-JP, shift-JIS, etc.

The point is, if you compile it on Bison one platform, and then transport
the output sources to another. But the problem may show up anyway
somewhere. -- One ends up with questions that ultimately has to do with the
failings of C/C++, not Bison.

>> -- Note that the problem does not exist for Unicode UTF-n encodings
>
>Only if everyone agrees to use that particular extension to ASCII.

I'm not sure what you mean here: If Bison has a Unicode feature to be
turned on, then that will work only for Unicode UTF-n, n >=21, streams, but
they will agree on any platform; the compacted yytranslate[] table will be
the same on any platform. Further, Linux evidently already using those
UTF-32, so as far as GNU is concerned, it should be a non-issue.

It is probably only backwards MSOS that uses UTF-16; but that ain't GNU. If
one uses UTF-16 and not symbols requiring more than one 16-bit binary
character, then the yytranslate[] table will be the same as of UTF-n, n >=
21.

>> Note that one may want to use the yytranslate[] table as is if one is using
>> distributed programming, say a WWW-browser reading ASCII on an EBCDIC
>> computer.
>
>Yes, that's the sort of scenario I was worried about.

But here it is a desirable feature: Only compile the sources with Bison on
the ASCII platform, and it will compile correctly on the EBCDIC computer.
The alternative would be to write sources like
  char ASCII_a = 0x41;
  ...
and then handwrite the lexer using that. This is what a guy writing a WWW
server told me he was doing. -- Extremely painful.

One ends up the question of defining which encodings the parser and lexer
should be able to handle.

Under C++, this can be done by hooking onto a code converter on the IO
streams. Thus, if one decides to settle for Unicode UTF-n, n >= 21,
internally in Flex/Bison, then the generated combined lexer/parser can be
made to parse any encoding by invoking the platform specific code
converter: Just compile the sources on say a Linux machine, which does it
correctly, and the compacted yytranslate[] table will be correct for
Unicode. On another platform, then invoke the local code converter from the
favorite format to Unicode UTF-n.

  Hans Aberg

[Prev in Thread]

Current Thread

[Next in Thread]

Re: RFC: enum instead of #define for tokens, (continued)
- Re: RFC: enum instead of #define for tokens, Hans Aberg, 2002/04/02
  - Re: RFC: enum instead of #define for tokens, Hans Aberg, 2002/04/03

Prev by Date: Question
Next by Date: Re: diffutils 2.8: newlines not ignored under -w
Previous by thread: Re: RFC: enum instead of #define for tokens
Next by thread: Re: RFC: enum instead of #define for tokens
Index(es):
- Date
- Thread