[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-8 Flex Character Classes

From: Hans Aberg
Subject: UTF-8 Flex Character Classes
Date: Sun, 09 Jan 2005 00:16:45 +0100
User-agent: Microsoft-Outlook-Express-Macintosh-Edition/5.0.6

It seems that one can translate UTF-8 character classes into byte-regular
expressions as follows:

UTF-8 has the following 1-6 multi-byte representations:
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The code positions U+D800 to U+DFFF (UTF-16 surrogates) as well as U+FFFE
and U+FFFF must not occur in UTF-8. Also, one must use the shortest possible
byte sequence; the other ones are illegal.

Now, given a UTF-character range (or a UTF-32 character range translated
into a UTF-8 character range), it decomposes as a choice of ranges among the
n-byte sequence, which can be handled using the r|s regular expression
construct. Therefore it suffices to treat within each k-byte group pattern,
a character range. Take, for example, the 2-byte group:
  U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
Suppose, in the given character range, the smallest 110xxxxx does not fill
up all 10xxxxxx patterns. Then the regular expression for this group becomes
this smallest 110xxxxx followed by a character range in the 10xxxxxx pattern
Then do the same for the largest 110xxxxx, to give another regular
expression. The remaining 110xxxxx each fill up all 10xxxxxx positions. So
this gives the regular expression of a character range in 110xxxxx followed
(concatenated) by the character class of all 10xxxxxx patterns. Then follow
a similar method for the higher byte (3 through 6) byte groups.

So it seems that also Flex's character class constructs can be adapted for
use with UTF-8. Then the UTF-8 mode should perhaps pretend that there are no
other characters than those in UTF-8.

  Hans Aberg

reply via email to

[Prev in Thread] Current Thread [Next in Thread]