help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Flex and 32-bits characters


From: Mark Weaver
Subject: RE: Flex and 32-bits characters
Date: Sat, 24 Aug 2002 13:42:24 +0100

My patch is basically an integration of the existing unicode patch and the
re-entrant patch to flex, which is now of course incorporated into the
mainline code.

UCS-4 is a 4-byte representation of ISO-10646-1, which essentially equals
unicode.  There is some historical difference between ISO-10646-1 and
unicode, but they are (basically) the same thing.  Unicode is a 21 bit
character set, which guarantees (presumably by having plenty of spare code
points) never to expand upon that range.  Then there are 5 basic ways of
representing the unicode character set:

UTF-8, which uses 8 bit characters and stores a code point with 1-4 bytes.
UTF-16 (little or big endian), which uses 16 bit integers, and stores a code
point with 1 or 2 integer
UTF-32 (LE/BE), which uses 32 bit integers and stores a code point with 1
integer

Most processing schemes (ref. ICU, windows) choose to use UTF-16 internally
as it provides the nicest trade-off between memory and performance.

The patch just defines YY_CHAR as wchar_t, which (cough) causes somewhat of
an explosion in the table size, as you would predict.  Flex spits out its
handy equivalence class table, which it indexes directly by character to
produce an int.  With a plain old char, this is table is a modest 1K, with a
wchar_t of 16-bits, it's 256K, and if your platform (which some do) defines
wchar_t as 32-bits, well...

The main questions to be addressed as I see it are:

1) What representation of unicode are you expecting to lex?  Are we
expecting flex to deal with this?  On this, I would pick one (say UTF-16
default platform endianness), and leave it to the caller to do the
relatively trivial conversion on the input stream.  I don't see it as an
issue that some characters are represented by pairs of 16-bit integers; the
scanner would just treat these as two separate characters to match, with the
correct result ensuing.  The same applies to UTF-8.  In fact, what you are
doing with accepting UTF-16 as an input is simply a speed optimisation (at
the expense of memory); you save the scanner dealing with single bytes when
you know that you never really have any - hence the DFA is smaller/faster.

2) Flex itself needs to have a unicode input for this to be truely useful.
As it stands, it arbitrarily widens characters; which works for ASCII but is
a unhelpful for anything else.  If you want to parse something that was
outside the ASCII range then you'd need flex to accept unicode (of some
form) input.  (Of course, flex scans its input with a scanner it generates,
so this wouldn't be too hard once you'd found an acceptable way of doing
so).

3) Some way of solving the memory explosion problem that results from having
large ec tables.  More details on your NFA->DFA algorithm?  Is this what was
being discussed earlier, when the great Mr Paxson popped in?  (I was keeping
a lazy eye on it).

I'm quite willing to help out with implementing a sensible solution to this
btw.

Mark

> -----Original Message-----
> From: address@hidden [mailto:address@hidden Behalf
> Of Hans Aberg
> Sent: 24 August 2002 10:33
> To: Antoine Fink
> Cc: address@hidden
> Subject: Re: Flex and 32-bits characters
>
>
> At 16:20 -0400 2002/08/23, Antoine Fink wrote:
> >I am currently working on a regular expression parser, using Flex and
> >Yacc, and >I want to be able to parse either ASCII, UTF-8 or
> UCS-4 strings.
>
> What is UCS-4; same as UTF-n, n = 24-32?
>
> >The general idea was to convert anything to ucs-4 (using 32 bits chars),
> >parse the regex, then re-convert (whenever possible) to the specified
> >matching encoding. (That part was already done some time ago when we used
> >our own C parsing program instead of Flex & Yacc, so this is not really
> >the issue).
>
> First note that I am not a Flex developer. But the issue was
> discussed here
> and in the Bison lists quite a bit some time ago:
>
> I think what you mention is the major candidate for implementing Unicode
> onto Flex: Hook up code converters (like C++ std::codecvt), so that
> internally Flex only sees say UTF-32. This is the only way to handle the
> many different possible encodings, plus the problem of variable width
> characters.
>
> The problem is really that Flex uses static tables indexed on the
> character
> type. Thus, if a 32-bit character type, you end up in principle
> with a 2^32
> large table, unless you somehow cut it down using some form of table
> compression.
>
> One interesting alternative might be to make Flex produce a very compact
> NFA machine table, which is converted to DFA states and cached as needed.
>
> >The problem is that I am unable to make Flex read in 32-bits characters
> >(in an easy fashion... say, typedef'ing chars to 32-bits integers, or
> >re-#define'ing chars has 32-bits integers, but that won't work
> at all, for
> >numerous reasons.)
>
> There is a "Unicode Flex" on the Internet:
>     Unicode Flex: ftp://ftp.lauton.com/pub/flex-2.5.4-unicode-patch.tar.gz
> In reality, I think though that it only changes char to wchar_t,
> and assume
> that that latter type is 16 bit.
> If you want to experiment with current Flex, it is at:
>     Flex Beta (2.5.15): ftp://ftp.uncg.edu/people/wlestes/
>
> >After reading lots of posts on the web, I have found some ways to
> >accomplish this 32-bit character lexing, and that which makes the most
> >sense to me is to (locally) modify Flex's own source code and make it
> >generate lexers that use 32-bits integers instead of chars.
>
> So, as said above, this is the obvious way to go, but one then hits the
> problem of large tables.
>
>   Hans Aberg
>
>
>
>
> _______________________________________________
> Help-flex mailing list
> address@hidden
> http://mail.gnu.org/mailman/listinfo/help-flex
>





reply via email to

[Prev in Thread] Current Thread [Next in Thread]