bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Paul Eggert
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Tue, 23 Nov 2021 19:36:11 -0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.0

On 9/30/14 12:39, Paul Eggert wrote:

GNU grep is smartenough to start matching at character boundaries without checking thevalidity of the input data. This helps it run faster. However, becauselibpcre requires a validity prepass, grep -P must slow down and do thevalidity check one way or another. Grep does this only when libpcre isused, and that's one reason grep -P is slower than plain grep.

Now that Grep master on Savannah has been changed to use PCRE2 insteadof PCRE, the 'grep -P' performance problem seems to have been fixed, inthat the following commands now take about the same amount of time:


grep -P zzzyyyxxx 10840.pdf
pcre2grep -U zzzyyyxxx 10840.pdf

where the file is from <http://research.nhm.org/pdfs/10840/10840.pdf>.Formerly, 'grep -P' was about 10x slower on this test.

My guess is that the grep -P performance boost comes from bleeding-edgegrep using PCRE2's PCRE2_MATCH_INVALID_UTF option.

I'm closing this old bug report <https://bugs.gnu.org/18454>. We canalways reopen it if there are still performance issues that I've missed.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert <=

Prev by Date: bug#19420: intermittent segfault using grep -P
Next by Date: bug#27555: [PATCH 1/1] tests: make surrogate-pair pass under Cygwin
Previous by thread: bug#19420: intermittent segfault using grep -P
Index(es):
- Date
- Thread