bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Regular expressions with high or nested repetition counters


From: Dag Hovland
Subject: Regular expressions with high or nested repetition counters
Date: Mon, 03 May 2010 09:51:36 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.0.4-1.fc12 Thunderbird/3.0.4

Hi!

I am a PhD student doing some research in regular expressions. I have
implemented an algorithm for matching a subclass of the regular
expressions with "numerical constraints", that is, expressions like
"a{10,100}". My starting point was the problems that can occur when
using large numbers, or nested constraints, as for example in the command:

grep -E "([0-9]{1,2}h([1-5]?[0-9]m([1-5]?[0-9]s){1,60}){1,60}){0,100}"

which consumes over 2 gb of memory on my computer.


from the grep man-page:
"Large repetition counts in the {n,m} construct may cause grep to use
lots of  memory."

An article describing the algorithm is available at
http://hdl.handle.net/1956/3628

A C-implementation is available at
http://www.ii.uib.no/~dagh/facgrep

The implementation is probably buggy, as I am not very good in C, but it
works for simple examples. On the example above, procps reports that it
uses around 2 megabytes of memory.

I wondered if you are interested in having this kind of functionality in
GNU grep? Note that for the majority of normal/simple regexps, grep does better than my program above. Also, it only handles a (well-defined) subclass of regexps. So we would need to find a nice way to choose which algorithm to use when.

Best regards,

Dag Hovland


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]