[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] unistr/u8-strchr: speed up searching for ASCII characters

From: Pádraig Brady
Subject: Re: [PATCH] unistr/u8-strchr: speed up searching for ASCII characters
Date: Mon, 12 Jul 2010 00:38:57 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20100227 Thunderbird/3.0.3

On 11/07/10 15:20, Paolo Bonzini wrote:
> On 07/07/2010 03:44 PM, Pádraig Brady wrote:
>> Subject: [PATCH] unistr/u8-strchr: speed up searching for ASCII
>> characters
>> * lib/unistr/u8-strchr.c (u8_strchr): Use strchr() for
>> the single byte case as it was measured to be 50% faster
>> than the existing code on x86 linux.  Also add a comment
>> on why not to use memmem() for the moment for the multibyte case.
> If p is surely a valid UTF-8 string, you can do better in general like
> this.  Say [q, q+q_len) points to an UTF-8 representation of uc:
>   for (; p = strchr (p, *q) && memcmp (p+1, q+1, q_len-1); p += q_len)
>     ;
>   return p;

That would be an improvement if strchr() would skip lots of p at a time,
to counter the function call overhead. However, the first byte of a multibyte
UTF-8 char is the same for a lot of characters, so I'm guessing there would
be lots of false positives in practice?

> That's because once the first byte has matched, the length of the UTF-8
> character is known to be q_len.  It's better than memmem if the startup
> cost of strchr is low enough (of course memcmp has to be
> inlined/unrolled/unswitched to get decent performance).
> Does the argument of u8_strchr have this guarantee?  If not, the above
> code can read arbitrary memory.

I was wondering myself about what parts of gnulib/unistring could take
advantage of assuming valid UTF-8 strings. From my own notes on this
function, I have:

"Some possible optimizations would need to
be conditional on CONFIG_UNICODE_SAFETY (see u8_mblen).
Note also u8_mbtouc_unsafe() and u8_mbtouc(), the latter
detecting invalid utf-8 chars even without --enable-safety
So given the above I'm assuming that most of gnulib/unistring
assumes valid UTF-8 (which users can enforce on input with u8_check()),
and if a safe but inefficient implementation option is possible
then it should be within CONFIG_UNICODE_SAFETY. Note I found
no mention of --enable-safety in the gnulib/libunistring configure scripts."


reply via email to

[Prev in Thread] Current Thread [Next in Thread]