[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case mapping of sharp s

From: grischka
Subject: Re: Case mapping of sharp s
Date: Tue, 24 Nov 2009 20:23:43 +0100
User-agent: Thunderbird (Windows/20090812)

Kenichi Handa wrote:
In article <address@hidden>, Andreas Schwab <address@hidden> writes:

Ulrich Mueller <address@hidden> writes:
Is that the reason for the backwards search in ERC buffers being
extremely slow? It may keep Emacs busy for several *minutes*. And it's
not interruptible with C-g.

Does this patch help?

Here are some ideas to improve it.

(1) Do forward matching in backward search.

The original code roughly does this to search "012abc"
for "012" from the tail.
   check if "012" matches with "abc"
   check if "012" matches with "2ab"

But the new code does this:
   check if "210" matches with "cba"
   check if "210" matches with "ba2"

As INC_BOTH is faster than DEC_BOTH, the original way of
check matching is faster.

DEC_BOTH is maybe not slower than INC_BOTH, but two DEC_BOTH
are (as with Andy's patch).  Moderately slower, still ;)

The slowness of the orignal code
was caused by using CHAR_TO_BYTE to find the place of "2"
when you know the place of "a".  Use DEC_BOTH here only.

The originally observed slowness was not because of the usage of
CHAR_TO_BYTE, but because of the flaws in CHAR_TO_BYTE, such as
using unrelated "best_below" and "best_above" in the same expression.

For the numbers, with my 100MB file test case:

backward search previously:
        14 .. 90 s (random)
backward search with fixed CHAR_TO_BYTE:
        5.6 s
backward search without CHAR_TO_BYTE (Andy's patch):
        4.1 s
forward search:
        3.6 s

(2) Pre-compute the character codes in PAT in an integer
    array if LEN is not that long (perhaps LEN < 256, or
    at most, sizeof (int) * LEN < MAX_ALLOCA).

Then, you don't need the repeated STRING_CHAR on PAT.  This
can be applicable to forward search too.

In practice searching a string is mostly about searching the first
char in the string, basically like strchr(buf, pat[0]).  (That is
unless you'd search for "aabb" in "aaabaaaaaaababbaaaabb" which is
not a practical example)

So as to pre-computing the pattern, you'd get the most improvement
already from just pre-computing "pat[0]" or "pat[len-1]" if you
want to.

(3) In addition to (2), pre-compute the character codes in
    BUF too in an array of the same length as (2).

Then you can avoid using STRING_CHAR and TRANSLATE
repeatedly on the same place of BUF.  This requires modulo
calculation to get an index of the array, but I think it's
faster than the combination of STRING_CHAR and TRANSLATE.

Because the first char matches rarely (on average), a repeated
translation of the same place happens rarely too.

Of course, TRANSLATE (-> Fassoc(trt, make_number())) per se is
slow,  so a translation table as C array for say the 0..127
range, would help indeed.

In any case, with some tweaking it is possible to improve both
directions by ~70% (that is down to about 1 sec for the test
case).  I still don't know why boyer_moore with a one-char
pattern takes only 0.5 seconds though.  It's amazingly fast.

Btw it seems that long loading time for the big file has much to
do with inefficient counting of newlines.  Appearently it takes
~2 sec to load the file and then another ~6 sec to scan newlines.
It should be (far) under 0.5 sec.

--- grischka

Kenichi Handa

reply via email to

[Prev in Thread] Current Thread [Next in Thread]