[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 Flex Character Classes

From: wlestes
Subject: Re: UTF-8 Flex Character Classes
Date: Mon, 10 Jan 2005 08:12:06 -0500 (EST)
User-agent: SquirrelMail/1.4.3a

How will this look from the perspective of the flex-based application
writer? That is, can a programmer use the natural syntax to describe UTF-8
data and, if flex were to use the below scheme, expect that flex would do
the right thing?

> It seems that one can translate UTF-8 character classes into byte-regular
> expressions as follows:
> UTF-8 has the following 1-6 multi-byte representations:
> U-00000000 - U-0000007F: 0xxxxxxx
> U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
> U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
> U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>                          10xxxxxx
> The code positions U+D800 to U+DFFF (UTF-16 surrogates) as well as U+FFFE
> and U+FFFF must not occur in UTF-8. Also, one must use the shortest
> possible
> byte sequence; the other ones are illegal.
> Now, given a UTF-character range (or a UTF-32 character range translated
> into a UTF-8 character range), it decomposes as a choice of ranges among
> the
> n-byte sequence, which can be handled using the r|s regular expression
> construct. Therefore it suffices to treat within each k-byte group
> pattern,
> a character range. Take, for example, the 2-byte group:
>   U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
> Suppose, in the given character range, the smallest 110xxxxx does not fill
> up all 10xxxxxx patterns. Then the regular expression for this group
> becomes
> this smallest 110xxxxx followed by a character range in the 10xxxxxx
> pattern
> Then do the same for the largest 110xxxxx, to give another regular
> expression. The remaining 110xxxxx each fill up all 10xxxxxx positions. So
> this gives the regular expression of a character range in 110xxxxx
> followed
> (concatenated) by the character class of all 10xxxxxx patterns. Then
> follow
> a similar method for the higher byte (3 through 6) byte groups.
> So it seems that also Flex's character class constructs can be adapted for
> use with UTF-8. Then the UTF-8 mode should perhaps pretend that there are
> no
> other characters than those in UTF-8.
>   Hans Aberg
> _______________________________________________
> Help-flex mailing list
> address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]