[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using libunistring for string comparisons et al

From: Mark H Weaver
Subject: Re: Using libunistring for string comparisons et al
Date: Fri, 18 Mar 2011 08:05:01 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)

Thien-Thi Nguyen <address@hidden> writes:

> () Mark H Weaver <address@hidden>
> () Thu, 17 Mar 2011 21:38:28 -0400
>    If we may assume that the searched string is valid UTF-8, and when only
>    ASCII characters are excluded (e.g. "."), then three additional states
>    are required in the generated DFA.  Let us call them S1, S2, and S3.
>    [handling these states]
>    When non-ASCII characters are excluded, additional states must be added:
>    one for each unique prefix of the excluded multibyte characters.  It's
>    quite straightforward.
> I don't understand what "excluded" means here.  Is this a property
> of the string, the regexp, the (dynamic, environmental) operation, or ...?

The excluded characters are the ones listed inside of a [^...] form.
For example, [^abc] excludes only the ASCII characters #\a, #\b, and
#\c.  [^abcλ] excludes also the non-ASCII character #\λ.

"." is equivalent to a [^...] form that excludes only newlines.

I should also note that although UTF-8 causes a few complications when
compiling regexps, UTF-32 creates much worse problems that require a
different DFA data structure and a slower scan.

In the UTF-8 case, each state in the DFA can contain a fixed lookup
table with 256 entries (one for each byte) as is used for ASCII or
Latin1.  However, in the UTF-32 case, it is _not_ practical for each
state to contain a lookup table with over 1 million entries (one for
each code point).


reply via email to

[Prev in Thread] Current Thread [Next in Thread]