bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Re: repeated extended pattern substitution incredibly slow w/large


From: Chet Ramey
Subject: Fwd: Re: repeated extended pattern substitution incredibly slow w/large variables
Date: Thu, 24 Nov 2016 16:48:30 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.5.0

--- Begin Message --- Subject: Re: repeated extended pattern substitution incredibly slow w/large variables Date: Thu, 24 Nov 2016 16:43:24 -0500 User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.5.0
On 9/17/16 2:32 PM, address@hidden wrote:

> Bash Version: 4.4
> Patch Level: 0
> Release Status: rc2 / release
> 
> Description:
>       The tests below were performed with 4.4.0-rc2. However, the problem is
>       still present in 4.4.0-release, only execution times are even higher
>       for about 20%.
> 
>       Repeated pattern substitution (here: removal) using an extended pattern
>       and variables of considerable size is incredibly time and cpu consuming.

It's true.  This is possibly the least efficient route to solve this
problem you could have chosen.

>       The command that revealed the problem was:
> 
>                D=${C//\[+([0-9])\]=}

It's not clear to me why you would have chosen to flatten an array into a
string, probably 30,000 characters long, before attempting to perform a
series of substitutions.  Consider what you have to do for each match: to
satisfy the leftmost-longest pattern matching semantics, once you match the
first character, you need to test the match from the end of the string,
one character at a time, to make sure you have the longest one each time.
Now, for the first match, you're going to end up calling the matcher
nearly 30,000 times to match the first four characters of the string.
This will be the case for every other match until you hit the end of the
string: millions of calls to the matcher.  It's hard to think of a less
efficient solution.

> 
>       The variable C contains the output of 'declare -p A', where A is an
>       array with 510 file names and C contains 510 matches. But as can be
>       seen below, also commands like
> 
>               D=${C//u+([a-z])}   or  D=${C//@(usr)}
> 
>       trigger the problem, but _not_ commands like
> 
>               D=${C//usr}         or  D=${C//u[a-z][a-z]}

Because the first two cases are unbounded: you can't tell the length of the
match by statically analyzing the pattern (well, in the second one you
might be able to, but the shell's matcher doesn't attempt it).  If you can
determine the maximum length of the match, as you can in the second two
examples, you can use that to reduce the number of calls to the matcher.

> 
>       See the test case and statistics below.
> 
>       Of course, the problem is simply solvable be a mini sed(1) script, but
>       every now and then I try comands like the above, because I think that
>       simple tasks should be doable without the aid of external programmes.
>       But in many such cases I must sadly accept that using external programs,
>       especially sed(1), is the quicker method.
>       Additionally I will have to revise my script (a ~100kb font editor)
>       and possibly replace other constructs using extended pattern maching.

The answer is simpler than you think.  Instead of flattening the array to a
string before you run the matcher, take advantage of the word expansions
the shell offers.

There's no reason to have the assignment to C add the `declare -a B=' just
so you can remove it.  You already have the individual strings in an array,
separated just how you want them.  You can use an assignment like:

        C="address@hidden//\[+([0-9])\]=/}"

to run the substitution on each individual array element in turn, then
"flatten" those results into a string.  It runs in a fraction of a second.
On my machine, with a 298-element array consisting of around 17,000
characters, the above construct takes

real    0m0.003s
user    0m0.002s
sys     0m0.000s

While it's true that the leftmost-longest matching semantics take quite a
long time to satisfy -- it's not a regular expression like sed is using;
it's a simple pattern matcher -- there is no reason to do things this way.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    address@hidden    http://cnswww.cns.cwru.edu/~chet/

--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]