bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: removing blank lines: "grep ." is really slow


From: Paolo Bonzini
Subject: Re: removing blank lines: "grep ." is really slow
Date: Mon, 19 Apr 2010 09:58:53 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.3

On 04/18/2010 06:32 AM, Ivan wrote:
On Apr 16, 2010, at 3:37 AM, Paolo Bonzini wrote:

True. You'd need to expand UTF-8 period characters to the appropriate
character sets, then you can use the faster single-byte character set
matcher. It's on my todo list.

So... right now, "." means "valid UTF-8 character"? Or not?

Yes, if your locale is UTF-8.

I'm a little
confused about the difference between the current behavior and the
behavior after you accomplish your todo list.

Instead of using the superslow algorithm that it's currently using, it would expand each . to

   [0x00-0x7f] |
   (([0xc2-0xdf] |
     (([0xe0-0xef] |
       [0xf0-0xf7] [0x80-0xbf]) [0x80-0xbf])) [0x80-0xbf])

This way it can be matched fast with a DFA. BTW, I got round to writing the patch and will commit it soon.

Anyway, I sent my original email because I couldn't think of any
non-buggy reason for "grep ." to take an entire millisecond per line.

grep's current algorithm to do multibyte matching is very bad. I haven't really had time to understand it, so all of my current optimization work was focused on optimizing the regex instead. It actually was pretty successful, but it means the worst case remains insanely slow.

I'm also puzzled by this:

bash$ time yes | head -n 5000 | grep '[a-b]' >/dev/null
real 0m0.159s

bash$ time yes | head -n 5000 | grep '[y-z]' >/dev/null
real 0m3.755s

bash$ time yes | head -n 5000 | grep '[yz]' >/dev/null
real 0m0.168s

Are these behaviors expected?

Yes, except that I would have thought [y-z] to be faster than [a-b]. The problem is that [y-z] means everything that _collates_ between y and z. In a Spanish locale, [l-m] could match the _two_ characters "ll". This is more difficult to handle, and similar to the period case above. The lack of functions to portably analyze the weights of the collation elements makes things worse.

I'm sure it can be optimized, but it won't be as easy as the low-hanging fruit above.

Paolo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]