bug#18454: closed (Improve performance when -P (PCRE) is used in UTF-8 l

emacs-bug-tracker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: closed (Improve performance when -P (PCRE) is used in UTF-8 l

From:	GNU bug Tracking System
Subject:	bug#18454: closed (Improve performance when -P (PCRE) is used in UTF-8 locales)
Date:	Wed, 24 Nov 2021 03:37:04 +0000

Your message dated Tue, 23 Nov 2021 19:36:11 -0800
with message-id <964e27cb-0484-0daf-ffab-39382de271f0@cs.ucla.edu>
and subject line Re: bug#18454: Improve performance when -P (PCRE) is used in 
UTF-8 locales
has caused the debbugs.gnu.org bug report #18454,
regarding Improve performance when -P (PCRE) is used in UTF-8 locales
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs@gnu.org.)


-- 
18454: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=18454
GNU Bug Tracking System
Contact help-debbugs@gnu.org with problems

--- Begin Message --- Subject: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Fri, 12 Sep 2014 03:24:49 +0200 User-agent: Mutt/1.5.23-6361-vl-r59709 (2014-07-25)

With the patch that fixes bug 18266, grep -P works again on binary
files (with invalid UTF-8 sequences), but it is now significantly
slower than old versions (which could yield undefined behavior).

Timings with the Debian packages on my personal svn working copy
(binary + text files):

2.18-2   0.9s with -P, 0.4s without -P
2.20-3  11.6s with -P, 0.4s without -P

On this example, that's a 13x slowdown! Though the performance issue
would better be fixed in libpcre3, I suppose that it is not so simple
and won't occur any time soon. Things could be done in grep:

1. Ignore -P when the pattern would have the same meaning without -P
   (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
   at least for the simplest cases).

2. Call PCRE in the C locale when this is equivalent.

3. Transform invalid bytes to null bytes in-place before the PCRE
   call. This changes the current semantic, but:
   * the semantic on invalid bytes has never been specified, AFAIK;
   * the best *practical* behavior may not be the current one
     (I personally prefer to be able to match invalid bytes, just
     like one can match top-bit-set characters in the C locale, and
     seeing such invalid bytes as equivalent to null bytes would
     not be a problem for most users, IMHO -- things can also be
     configurable).

-- 
Vincent Lefèvre <vincent@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--- End Message ---

--- Begin Message --- Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Date: Tue, 23 Nov 2021 19:36:11 -0800 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.0
On 9/30/14 12:39, Paul Eggert wrote:
GNU grep is smartenough to start matching at character boundaries without checking thevalidity of the input data. This helps it run faster. However, becauselibpcre requires a validity prepass, grep -P must slow down and do thevalidity check one way or another. Grep does this only when libpcre isused, and that's one reason grep -P is slower than plain grep.
Now that Grep master on Savannah has been changed to use PCRE2 insteadof PCRE, the 'grep -P' performance problem seems to have been fixed, inthat the following commands now take about the same amount of time:
grep -P zzzyyyxxx 10840.pdf
pcre2grep -U zzzyyyxxx 10840.pdf
where the file is from <http://research.nhm.org/pdfs/10840/10840.pdf>.Formerly, 'grep -P' was about 10x slower on this test.
My guess is that the grep -P performance boost comes from bleeding-edgegrep using PCRE2's PCRE2_MATCH_INVALID_UTF option.
I'm closing this old bug report <https://bugs.gnu.org/18454>. We canalways reopen it if there are still performance issues that I've missed.
--- End Message ---

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: closed (Improve performance when -P (PCRE) is used in UTF-8 locales), GNU bug Tracking System <=

Prev by Date: bug#19420: closed (intermittent segfault using grep -P)
Next by Date: bug#52065: closed ([PATCH core-updates-frozen] gnu: gdm: Pass GDK_PIXBUF_MODULE_FILE to sessions.)
Previous by thread: bug#19420: closed (intermittent segfault using grep -P)
Next by thread: bug#52065: closed ([PATCH core-updates-frozen] gnu: gdm: Pass GDK_PIXBUF_MODULE_FILE to sessions.)
Index(es):
- Date
- Thread