[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case mapping of sharp s

From: Kenichi Handa
Subject: Re: Case mapping of sharp s
Date: Wed, 25 Nov 2009 11:13:23 +0900

In article <address@hidden>, grischka <address@hidden> writes:

> DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH
> are (as with Andy's patch).  Moderately slower, still ;)

So, changing the current backward matching to forward
matching should is effective.

> The originally observed slowness was not because of the usage of
> CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as
> using unrelated "best_below" and "best_above" in the same expression.

> For the numbers, with my 100MB file test case:

> backward search previously:
>       14 .. 90 s (random)
> backward search with fixed CHAR_TO_BYTE:
>       5.6 s

I don't see any fix of CHAR_TO_BYTE in the current CVS
code.  Where is it?

> > (2) Pre-compute the character codes in PAT in an integer
> >     array if LEN is not that long (perhaps LEN < 256, or
> >     at most, sizeof (int) * LEN < MAX_ALLOCA).
> > 
> > Then, you don't need the repeated STRING_CHAR on PAT.  This
> > can be applicable to forward search too.
> >

> In practice searching a string is mostly about searching the first
> char in the string, basically like strchr(buf, pat[0]).  (That is
> unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is
> not a practical example)

> So as to pre-computing the pattern, you'd get the most improvement
> already from just pre-computing "pat[0]" or "pat[len-1]" if you
> want to.

When there's no match, it's true that pre-computing of the
whole pattern is a waste.  But, when there's a match, we
anyway compute character codes of the whole pattern.  So
perhaps the good strategy will be to record the computed
character codes gradually when we need it.

> > (3) In addition to (2), pre-compute the character codes in
> >     BUF too in an array of the same length as (2).
> > 
> > Then you can avoid using STRING_CHAR and TRANSLATE
> > repeatedly on the same place of BUF.  This requires modulo
> > calculation to get an index of the array, but I think it's
> > faster than the combination of STRING_CHAR and TRANSLATE.

> Because the first char matches rarely (on average), a repeated
> translation of the same place happens rarely too.

> Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is
> slow,  so a translation table as C array for say the 0..127
> range, would help indeed.

> In any case, with some tweaking it is possible to improve both
> directions by ~70% (that is down to about 1 sec for the test
> case).  I still don't know why boyer_moore with a one-char
> pattern takes only 0.5 seconds though.  It's amazingly fast.

Are you comparing both methods with the same value of

> Btw it seems that long loading time for the big file has much to
> do with inefficient counting of newlines.  Appearently it takes
> ~2 sec to load the file and then another ~6 sec to scan newlines.
> It should be (far) under 0.5 sec.

Why is the code of counting newlines called when we just
visit a file?

Kenichi Handa

reply via email to

[Prev in Thread] Current Thread [Next in Thread]