Grep with UTF8 is slow

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Grep with UTF8 is slow

From:	A. Hjortland
Subject:	Grep with UTF8 is slow
Date:	16 Sep 2003 22:19:13 +0200

This is a resend of a mail I sent 2003-04-24, which I got no reply to.



Grep (at least 2.5 and 2.5.1) is *very* slow in cases where most input
lines match and the charset is UTF8.

Look at the second case (3 seconds!):
----------------------------------
> export LC_ALL=no_NO.UTF8
> time find /usr/bin | grep -c zzz
0

real    0m0.035s
user    0m0.000s
sys     0m0.030s
> time find /usr/bin | grep -c bin
2054

real    0m3.364s
user    0m3.320s
sys     0m0.000s
----------------------------------
> export LC_ALL=C
> time find /usr/bin | grep -c zzz
0

real    0m0.021s
user    0m0.000s
sys     0m0.010s
> time find /usr/bin | grep -c bin
2054

real    0m0.021s
user    0m0.010s
sys     0m0.010s
----------------------------------

We tracked the problem down to EGexecute in search.c.
As I understand, the function scans a buffer and returns _one_ match
from the buffer each time it's called. The problem is: For each call to
EGexecute, check_multibyte_string (also in search.c) is called once, on
the *entire buffer*. If all N lines match, and the buffer contains, say
1000 lines, that means check_multibyte_string will have to process
N*1000 lines, not just N lines. Hence the low performance.

Patch suggestion attached.
Warning: This is just something i wipped together rapher haphazardly. No
guaranties, here :)

How the patch works:
Instead of parsing the _entire buffer at once_ with
check_multibyte_string, parse it _incrementally_ in chunks of 100 bytes,
as far as needed.


Obviously, the other execute-functions must be patched too.


--Håkon A. Hjortland

grep-2.5_search.c_utf8speed.diff
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

Grep with UTF8 is slow, A. Hjortland <=

Prev by Date: Bug in GNU gawk - matching initial space in RE
Next by Date: Get bigger size tb gfxo whycgfq
Previous by thread: Bug in GNU gawk - matching initial space in RE
Next by thread: Get bigger size tb gfxo whycgfq
Index(es):
- Date
- Thread