[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-8/Unicode Bison

From: Hans Aberg
Subject: UTF-8/Unicode Bison
Date: Sun, 09 Jan 2005 14:40:41 +0100
User-agent: Microsoft-Outlook-Express-Macintosh-Edition/5.0.6

There seems to be a simple way to extend Bison to Unicode. Essentially, this
embarks to give meaning to the '...' construct for Unicode characters. One
way is to treat this as a UTF-8 multibyte sequence. Bison would thus treat
this as a sequence of character tokens. Now, if the .y grammar file is
assumed to be in UTF-8, then what is needed is to give 'c1 ... ck' meaning
for a suitable character sequence, by merely translating it into the
character token sequence 'c1'...'ck'.

As for the yylex handshaking, I see two possibilities: A UTF-8 mode, where a
multibyte sequence is returned one by one, in a succession of yylex calls.
An a Unicode mode, where yylex returns the full Unicode number in UTF-32.
Bison would then start its token number at number higher than 0x10FFFF, the
highest possible Unicode number. If a Unicode number is returned by yylex,
then the Bison parser translates this into a UTF-8 sequence, which is the
processed as normal.

  Hans Aberg

reply via email to

[Prev in Thread] Current Thread [Next in Thread]