[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case mapping of sharp s

From: Ulrich Mueller
Subject: Re: Case mapping of sharp s
Date: Fri, 20 Nov 2009 09:10:29 +0100

>>>>> On Thu, 19 Nov 2009, David Kastrup wrote:

>> I can guess why it's much slower going backward: the simple search
>> operates on chars rather than bytes. The internal encoding we use
>> (currently based on utf-8) is designed to be easy to parse going
>> forward but not so easy going backward (IIRC our encoding is
>> actually even a bit more painful in this case than pure utf-8).

> I don't think so. The utf-8 _scheme_ can be used to encode 21bits in
> 4 characters.

The original UTF-8 (specified in RFC 2279) was good for encoding of
the full range of 2^31 characters in up to 6 bytes. The limitation to
2^20.1 came later and is artificial.

> We stay within that range, in the utf-8 4 character scheme, but
> outside of the Unicode range 2^20+2^16.

character.h says it's up to 22 bits encoded in up to 5 bytes:

|    character code     1st byte   byte sequence
|    --------------     --------   -------------
|         0-7F          00..7F     0xxxxxxx
|        80-7FF         C2..DF     110xxxxx 10xxxxxx
|       800-FFFF                E0..EF     1110xxxx 10xxxxxx 10xxxxxx
|     10000-1FFFFF      F0..F7     11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
|    200000-3FFF7F      F8         11111000 1000xxxx 10xxxxxx 10xxxxxx 10xxxxxx
|    3FFF80-3FFFFF      C0..C1     1100000x 10xxxxxx (for eight-bit-char)
|    400000-...         invalid

>> BM on the other hand works on bytes, so there's no such slowdown.

> With utf-8, I think that apart from character ranges, search forward and
> backward should work perfectly like on 8-bit characters.  Exception is
> incomplete character matches, but since the utf-8 scheme can immediately
> tell "is a 7-bit character" "is the first character of a multibyte
> sequence of length n" "is last or intermediate character of multibyte
> sequence" this is not a serious problem.

When the search is for equivalence classes of characters (e.g. case
folding), then I think it must operate on whole characters and
therefore has to find the start of each multibyte sequence.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]