Re: removing blank lines: "grep ." is really slow

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: removing blank lines: "grep ." is really slow

From:	Paolo Bonzini
Subject:	Re: removing blank lines: "grep ." is really slow
Date:	Mon, 19 Apr 2010 09:58:53 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.3

On 04/18/2010 06:32 AM, Ivan wrote:

On Apr 16, 2010, at 3:37 AM, Paolo Bonzini wrote:

True. You'd need to expand UTF-8 period characters to the appropriate
character sets, then you can use the faster single-byte character set
matcher. It's on my todo list.


So... right now, "." means "valid UTF-8 character"? Or not?


Yes, if your locale is UTF-8.

I'm a little
confused about the difference between the current behavior and the
behavior after you accomplish your todo list.

Instead of using the superslow algorithm that it's currently using, itwould expand each . to


   [0x00-0x7f] |
   (([0xc2-0xdf] |
     (([0xe0-0xef] |
       [0xf0-0xf7] [0x80-0xbf]) [0x80-0xbf])) [0x80-0xbf])

This way it can be matched fast with a DFA. BTW, I got round to writingthe patch and will commit it soon.

Anyway, I sent my original email because I couldn't think of any
non-buggy reason for "grep ." to take an entire millisecond per line.

grep's current algorithm to do multibyte matching is very bad. Ihaven't really had time to understand it, so all of my currentoptimization work was focused on optimizing the regex instead. Itactually was pretty successful, but it means the worst case remainsinsanely slow.

I'm also puzzled by this:

bash$ time yes | head -n 5000 | grep '[a-b]' >/dev/null
real 0m0.159s

bash$ time yes | head -n 5000 | grep '[y-z]' >/dev/null
real 0m3.755s

bash$ time yes | head -n 5000 | grep '[yz]' >/dev/null
real 0m0.168s

Are these behaviors expected?

Yes, except that I would have thought [y-z] to be faster than [a-b].The problem is that [y-z] means everything that _collates_ between y andz. In a Spanish locale, [l-m] could match the _two_ characters "ll".This is more difficult to handle, and similar to the period case above.The lack of functions to portably analyze the weights of the collationelements makes things worse.

I'm sure it can be optimized, but it won't be as easy as the low-hangingfruit above.


Paolo

[Prev in Thread]

Current Thread

[Next in Thread]

removing blank lines: "grep ." is really slow, Ivan, 2010/04/15
- Re: removing blank lines: "grep ." is really slow, Paolo Bonzini, 2010/04/16
  - Re: removing blank lines: "grep ." is really slow, Ivan, 2010/04/18
    - Re: removing blank lines: "grep ." is really slow, Paolo Bonzini <=
    - Re: removing blank lines: "grep ." is really slow, Paul Eggert, 2010/04/23
    - Re: removing blank lines: "grep ." is really slow, Paolo Bonzini, 2010/04/24

Prev by Date: Re: [PATCH 3/3] dfa: optimize UTF-8 period
Next by Date: Re: [PATCH 3/3] dfa: optimize UTF-8 period
Previous by thread: Re: removing blank lines: "grep ." is really slow
Next by thread: Re: removing blank lines: "grep ." is really slow
Index(es):
- Date
- Thread