[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
From: |
Vincent Lefevre |
Subject: |
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales |
Date: |
Fri, 28 Nov 2014 03:59:18 +0100 |
User-agent: |
Mutt/1.5.23-6365-vl-r59709 (2014-09-07) |
On binary files, it seems that testing the UTF-8 sequences in
pcresearch.c is faster than asking pcre_exec to do that (because
of the retry I assume); see attached patch. It actually checks
UTF-8 only if an invalid sequence was already found by pcre_exec,
assuming that pcre_exec can check the validity of a valid text
file in a faster way.
On some file similar to PDF (test 1):
Before: 1.77s
After: 1.38s
But now, the main problem is the many pcre_exec. Indeed, if I replace
the non-ASCII bytes by \n with:
LC_ALL=C tr \\200-\\377 \\n
(now, one has a valid file but with many short lines), the grep -P time
is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes
with:
LC_ALL=C tr \\200-\\377 \\000
the grep -P time is 0.30s (test 3), thus it is much faster.
Note also that libpcre is much slower than normal grep on simple words,
but on "a[0-9]b", it can be faster:
grep PCRE PCRE+patch
test 1 4.31 1.90 1.53
test 2 0.18 1.61 1.63
test 3 3.28 0.39 0.39
With grep, I wonder why test 2 is much faster.
--
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
grep221-pcresearch.patch
Description: Text document
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales,
Vincent Lefevre <=