[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #29391] -i and utf8 slowness, speedup idea

From: Egmont Koblinger
Subject: [bug #29391] -i and utf8 slowness, speedup idea
Date: Wed, 31 Mar 2010 09:13:47 +0000
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100217 Firefox/3.5.8


                 Summary: -i and utf8 slowness, speedup idea
                 Project: grep
            Submitted by: egmont
            Submitted on: Wed 31 Mar 2010 09:13:46 AM GMT
                Category: None
                Severity: 3 - Normal
              Item Group: None
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any



[ followup of https://bugzilla.redhat.com/show_bug.cgi?id=194471 ]

The combination of --ignore-case and UTF-8 is very slow, even when no special
treatment is required for UTF-8.  There's also a huge regression in speed
compared to Ubuntu Hardy's grep-2.5.3 with I-don't-know-what patches. Some
timing data:

$ # Create a 100MB file.
$ dd if=/dev/urandom of=data bs=1M count=100

$ # Ubuntu's grep-2.5.3, UTF-8: fast
$ time LC_ALL=en_US.UTF-8 /bin/fgrep -i foobar data
real    0m0.245s
user    0m0.128s
sys     0m0.112s

$ # Stock grep-2.6.2, 8-bit: fast
$ time LC_ALL=en_US fgrep -i foobar data
real    0m0.156s
user    0m0.084s
sys     0m0.072s

$ # Stock grep-2.6.2, UTF-8: very slow
$ time LC_ALL=en_US.UTF-8 fgrep -i foobar data
real    0m10.264s
user    0m10.049s
sys     0m0.080s

This is approximately a 40-60x slowdown.

I understand that the combination of UTF-8 and ignore-case is a tricky
situation, and if I'm having tr_TR.UTF-8 locale then sure I want to pay this
price for the correct handling of dotless i's.

Most of the time, however, I'm working with en_US.UTF-8 and grepping variable
names in source code and such, usually without any accents.

Grep could do the following:

It could look at the pattern, and check if the following conditions are all

- no placeholder that could match a variable-length character (e.g. no "." in
the pattern) or other weird stuff

- only ASCII characters

- only characters whose old-fashioned ASCII upper/lowercase counterparts are
the same as the locale-aware upper/lowercase counterparts, that is, no "i" or
"I" in the pattern if the locale is Turkish.

If all these are true, it could use whatever algorithm it's using for 8-bit
locales, because it will find the same matches. This would provide a 40-60x
speedup for a very common use case: case insensitively finding an English


Reply to this item at:


  Message sent via/by Savannah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]