[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 Flex Character Classes

From: Hans Aberg
Subject: Re: UTF-8 Flex Character Classes
Date: Mon, 10 Jan 2005 18:52:23 +0100
User-agent: Microsoft-Outlook-Express-Macintosh-Edition/5.0.6

At 08:12 -0500 2005/01/10, address@hidden wrote:
>How will this look from the perspective of the flex-based application
>writer? That is, can a programmer use the natural syntax to describe UTF-8
>data and, if flex were to use the below scheme, expect that flex would do
>the right thing?

Over the weekend I have hacked together some UTF-8/UTF-32 functions in
Haskell. This includes a translation of UTF-32 character intervals into
regular expression UTF-8 character classes; currently only working for UTF-8
1 or 2 byte sequences, but I try to extend it fully right now.

Exactly what happens can probably only be determined by making an
implementation, as it is easy to overlook things when thinking about it
theoretically. But it looks as though one might extend the Flex character
classes. If the .l file is assumed to be in UTF-8, one merely admits UTF-8
in the character classes as well. In the compilation of these character
classes, the UTF-8 characters they contain are translated into UTF-32. This
produces a set of intervals in UTF-32, which can be reduced to individual
treatment by using regular expression "|". Then each interval needs a
translator into regular expressions, as the one I have written partially.
One then possibly needs some extra new additions, admitting better
description of character classes.

As for the "." character class, it might suffice to add a construct for the
1-6 byte UTF-sequences, plus one catching them all. (For use with Bison, the
lexer might decide to return a UTF-8 character sequence as either a UTF-32
int or as a sequence of bytes. Cf., my suggestion in the Bug-Bison list.)
In this, and the above, I have not given special attention to the illegal
UTF-8 character sequences.

  Hans Aberg

reply via email to

[Prev in Thread] Current Thread [Next in Thread]