[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-8 Flex

From: Hans Aberg
Subject: UTF-8 Flex
Date: Sat, 08 Jan 2005 20:27:21 +0100
User-agent: Microsoft-Outlook-Express-Macintosh-Edition/5.0.6

If one assumes that Flex should be extended to UTF-8, then one can probably
assume that the .l input file also is in UTF-8 format. This simplifies the
extension problem. One should then go through the different patterns, and
find suitable changes. First, the UTF-8 version should be 8-bit. But then
some patterns should altered. (See "UTF-8 and Unicode FAQ"
for info about these encodings.)

Following the list of patterns in the manual, I can produce some immediate
  x match the character x
- No change needed.

  . any character (byte) except newline
- Match only ASCII characters, i.e., highest bit 0. Then make special
symbols matching bytes starting with 11 (leading multibyte), 10 (following
multibyte). Suppose, ad hoc, that these character ranges have symbols ":"
resp. ";". Then all UTF-8 one character ranges can be covered by:

Character classes:
- One might decide to restrict these to ASCII (7-bit) characters, and invent
special ones for UTF-8.

Pattern repetitions:
  r*     zero or more r's
  r+     one or more r's
  r?     zero or one r's
  r{2,5} anywhere from two to five r's
  r{2,}  two or more r's
  r{4}   exactly 4 r's
- These seem to not need any change.

  {name} the expansion of the "name" definition
- Identifiers "name" might be restricted to ASCII names.

  "[xyz]\"foo" the literal string: [xyz]"foo
- No change needed.

  \x  escape sequence
  \0 a NULL character
- No changes needed.

  \123  the character with octal value 123
- This probably need not be changed, because octal numbers will probably not
be used to indicate higher Unicode numbers.

  \x2a  the character with hexadecimal value 2a
- This should probably be changed so that one can add arbitrary numbers. One
might add two constructs: \x......., which can expand to any Unicode number,
and which is converted into UTF-8, and \u........, which checks is the
hexadecimal number is in the valid Unicode range. Alternatively, one might
keep \x.. to only indicate 1-byte hexadecimal numbers, and let \u........
denote any 31-bit number, leaving the valid Unicode number check to the
user. In any case, one should not be able to use them in [...] character
ranges, if the latter are restricted to ASCII ranges.

  (r)  parentheses
  rs   r followed by s
  r|s  r or s
  r/s  r only when followed by s
  ^r   r only at the beginning of a line
  r$   r but only at the end of a line
- No changes seem to be needed.

So it seems that if one skips over the problem of UTF-8 character classes, a
tentative UTF-8 mode might be designed fairly quickly.

  Hans Aberg

reply via email to

[Prev in Thread] Current Thread [Next in Thread]