[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Compiling Elisp to a native code with a GCC plugin

From: Thomas Lord
Subject: Re: Compiling Elisp to a native code with a GCC plugin
Date: Wed, 15 Sep 2010 09:28:57 -0700

On Wed, 2010-09-15 at 17:46 +0200, Helmut Eller wrote:
> * Stefan Monnier [2010-09-15 14:59] writes:

> > IIRC regexps without back-refs can be matched (and searched) in O(N)
> > where N is the length of the input.  

Not quite.   Let R be the length of the pattern
and L be the length of the string we seek to match.

Assume that the pattern is a true regular expression
(no back-references, no sub-expression position
reporting, etc.)

The problem itself is O(R * L):  no algorithm can
guarantee doing better than the product of the two

There is an algorithm (compiling the pattern to a DFA
first) which is O(2^R + L). Formally it is suboptimal.
There is a catch though.   For many common patterns,
either R is very small or we don't actually need
anything like 2^R steps to compile the pattern,
and so the *expected* case (for these patterns)
is O(L).

The Thompson "DFA caching algorithm" described briefly
in the Dragon compiler book is quite attractive
because it offers (for true regular expressions) a 
worst case complexity of O(R * L) ... so it is optimal ...
yet also delivers much closer to O(L) for many
common patterns.   The algorithm is a bit 
unattractive because it is tricky to implement 
well, very hard to generalize beyond true regular
expressions, and can easily lose to more naive NFA
algorithms by small but annoying constant factors.

> With back-refs, I think (not sure)
> > the theoretical bound is O(N^2), which requires
> > a non-backtracking algorithm.

> > So yes, we'd need to handle back-refs specially.  Several regexp engines
> > do that already (they have a few different inner engines and choose
> > which one to use based on the particular regexp at hand).
> After googleing a bit I found this page 
> http://swtch.com/~rsc/regexp/regexp1.html
> which again links to this
> http://perl.plover.com/NPC/NPC-3SAT.html
> which says that regexp matching with backreferences is NP-complete.

The best known algorithms are not O(N^2) but,
rather, O(2^N) in the worst case.   It's dismal
how such an innocent seeming feature can cause
such havoc.

> Cox (the first page) seems to say that backtracking-with-memoization is
> linear time at the expense of O(N) space.

If he said that regarding worst case performance
and without mentioning some specific subset of 
true regular expressions that he's talking about,
he must have made a mistake.

For true regular expressions, the simpler case,
you can not beat O(R * L) time as the worst case.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]