bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16893: [PATCH] Avoid matching line-by-line for case-insensitive with


From: Jim Meyering
Subject: bug#16893: [PATCH] Avoid matching line-by-line for case-insensitive with grep
Date: Thu, 27 Feb 2014 18:54:14 -0800

On Thu, Feb 27, 2014 at 8:54 AM, Glenn Morris <address@hidden> wrote:
>
> This message was sent to the address@hidden address with no
> "Package:" specified in the body. So it ended up on the help-debbugs
> mailing list rather than bug-grep. I have assigned it to grep and am
> sending this mail, which will appear on the bug-grep list.
>
> For new reports, either use the bug-grep address, or remember to use
> Package: grep at the start of the body. They both have identical results.
>
> Norihiro Tanaka wrote:
>
>> Now grep and awk matchers doesn't waste buffer in case-sensisitive matching.
>> So I think that we can avoid line-by-line matching for them.
>>
>> It enable to speed up case-sensitive matching with grep or awk matcher
>> without trivial_case_ignore as fast as when with it.
>>
>> In bug#16232:
>>> The following times 2.16, 2.17 and 2.17+patch two ways:
>>>
>>> $ yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
>>> $ for i in 16 17 18; do echo $i; env LC_ALL=en_US.UTF-8 time
>>> /p/p/grep-2.$i/bin/grep -i foobar k; done
>>> 16
>>>        15.96 real        14.57 user         0.12 sys
>>> 17
>>>         1.13 real         1.07 user         0.06 sys
>>> 18
>>>         1.96 real         1.89 user         0.06 sys
>>>
>>> The above search takes more than 70% longer with the proposed patch.
>>
>> Therefore, I think 30% slow-down is caused by the line-by-line matching
>> for them.
>
> [See attachment at http://debbugs.gnu.org/16893]

Thank you for forwarding that, Glenn.

Thank you for the patch, Norihiro.
However, your removal of the "MB_CUR_MAX == 1" disjunct
would cause unibyte grep -i with (-F or -P) to match line-by-line,
whereas currently it uses the buffer-matching code.
That would make a search like this take 3 times longer:

  seq 30000000 > in
  LC_ALL=C grep -iF foo in

Without your patch, best of 5 wall clock time is 0.31s,
yet with the patch, it takes 1.00s.

I presume you will want to retain that disjunct?
If you submit an adjusted patch, please also include in the
commit log some timing examples showing how the change affects
grep's performance.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]