[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Grep with UTF8 is slow

From: A. Hjortland
Subject: Grep with UTF8 is slow
Date: 16 Sep 2003 22:19:13 +0200

This is a resend of a mail I sent 2003-04-24, which I got no reply to.

Grep (at least 2.5 and 2.5.1) is *very* slow in cases where most input
lines match and the charset is UTF8.

Look at the second case (3 seconds!):
> export LC_ALL=no_NO.UTF8
> time find /usr/bin | grep -c zzz

real    0m0.035s
user    0m0.000s
sys     0m0.030s
> time find /usr/bin | grep -c bin

real    0m3.364s
user    0m3.320s
sys     0m0.000s
> export LC_ALL=C
> time find /usr/bin | grep -c zzz

real    0m0.021s
user    0m0.000s
sys     0m0.010s
> time find /usr/bin | grep -c bin

real    0m0.021s
user    0m0.010s
sys     0m0.010s

We tracked the problem down to EGexecute in search.c.
As I understand, the function scans a buffer and returns _one_ match
from the buffer each time it's called. The problem is: For each call to
EGexecute, check_multibyte_string (also in search.c) is called once, on
the *entire buffer*. If all N lines match, and the buffer contains, say
1000 lines, that means check_multibyte_string will have to process
N*1000 lines, not just N lines. Hence the low performance.

Patch suggestion attached.
Warning: This is just something i wipped together rapher haphazardly. No
guaranties, here :)

How the patch works:
Instead of parsing the _entire buffer at once_ with
check_multibyte_string, parse it _incrementally_ in chunks of 100 bytes,
as far as needed.

Obviously, the other execute-functions must be patched too.

--Håkon A. Hjortland

Attachment: grep-2.5_search.c_utf8speed.diff
Description: Text document

reply via email to

[Prev in Thread] Current Thread [Next in Thread]