[debbugs-tracker] bug#26193: closed ([0-9] versus [[:digit:]])

From:

GNU bug Tracking System

Subject:

Date:

Thu, 23 Mar 2017 01:58:02 +0000

Your message dated Wed, 22 Mar 2017 18:57:05 -0700 with message-id <address@hidden> and subject line Re: bug#26193: [0-9] versus [[:digit:]] has caused the debbugs.gnu.org bug report #26193, regarding [0-9] versus [[:digit:]] to be marked as done. (If you believe you have received this mail in error, please contact address@hidden) -- 26193: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=26193 GNU Bug Tracking System Contact address@hidden with problems

--- Begin Message --- Subject: [0-9] versus [[:digit:]] Date: Mon, 20 Mar 2017 11:34:05 -0400

In what follows, file "conjectures" is a 6 billion bytes file in which each line contains at most one letter P, and few (see output) have a digit following a P. "rusage" is just a home-brew resource usage summary command.

rusage egrep 'P[0-9]' conjectures > xxx

695.55 real 688.33 user 2.40 sys 0 pf 186 pr 0 sw 0 rb 8 wb 1 vcx 19206 icx 2488 mx 0 ix 0 id 0 is

cat xxx

A[21]=11{11}:22<LP3

rusage egrep 'P[[:digit:]]' conjectures > xxx

14.88 real 13.36 user 1.43 sys 0 pf 186 pr 0 sw 0 rb 8 wb 0 vcx 516 icx 2500 mx 0 ix 0 id 0 is

cat xxx

A[21]=11{11}:22<LP3

Using what is to me the more obvious [0-9] pattern takes almost 50 times as long as using the [[:digit:]] pattern. Seems very strange.

grep --version

grep (GNU grep) 2.25

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others, see <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.

uname -a

Linux jpl 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

--- End Message ---

--- Begin Message --- Subject: Re: bug#26193: [0-9] versus [[:digit:]] Date: Wed, 22 Mar 2017 18:57:05 -0700

On Wed, Mar 22, 2017 at 2:58 PM, John P. Linderman <address@hidden> wrote:
> I used to use LC_ALL=C, but, as I vaguely recall, it got in the way of
> dealing with UNICODE. I tried a couple LC values aimed at UNICODE and the
> US, but something always went pear-shaped. I finally give up. I am perfectly
> happy to suffer a tiny bit of performance, to have most things work without
> thinking. A factor of 6, or 35, is not tiny, since I use grep and friends
> intensely. That's how I discovered the performance problem to begin with.
> Anyway, thank you for fixing my problem. I suspect that many of us pioneers
> (using UNIX since 1973) have '[0-9]' wired into our fingers.
>
> On Wed, Mar 22, 2017 at 2:01 PM, Paul Eggert <address@hidden> wrote:
>>
>> On 03/22/2017 05:44 AM, John P. Linderman wrote:
>>>
>>> That puts the runtimes on equal footing:
>>>
>> In my measurements, P[0-9] is still a tiny bit slower if one is using
>> glibc regex, due to a performance problem in glibc. You can work around it
>> by configuring --with-included-regex. It's probably not worth worrying
>> about, though.
>>
>> By the way, using LC_ALL=C should help avoid performance problems like
>> these in the future, if all you're doing is something where single-byte
>> pattern matching suffices.

I've just pulled that gnulib change into grep's repository with the
attached, along with a NEWS update:

grep-gnulib-dfa-NEWS.diff
Description: Text document

--- End Message ---

[Prev in Thread]

Current Thread

[Next in Thread]

[debbugs-tracker] bug#26193: closed ([0-9] versus [[:digit:]]), GNU bug Tracking System <=