[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Patch for lookaround assertion in regexp

From: Daniel Colascione
Subject: Re: Patch for lookaround assertion in regexp
Date: Tue, 14 Feb 2012 09:53:07 -0800
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:10.0.1) Gecko/20120208 Thunderbird/10.0.1

On 1/23/12 6:11 AM, Stefan Monnier wrote:
>> In 2009, Tomohiro Matsuyama sent a message to this list with a patch
>> to add lookahead/lookbehind assertions to Emacs regular expressions
>> (regexps).  Is there any plan to incorporate this useful feature into
>> an official release?
>> [IMHO, it is incredibly useful. The only responses to Matsuyama-san's
>> message were positive.]
> I'd like to replace the current regexp engine with one that does not
> suffer from exponential blowup (i.e. using "Thompson's NFA").
> Every feature added to the current regexp engine will make it more
> difficult to switch, so I'm not too thrilled at the idea of adding
> lookaround operators.
> OTOH, noone has submitted code to replace the current regexp engine, and
> I don't forsee I'll have the time to write it myself, so maybe I should
> just give up on this plan.

Implementing a fully general NFA-based regular expression matching
engine that support submatches is hard. The only two useful
implementations of which I'm aware are RE2 and Ville Laurikari's TRE,
both of which are two-clause BSD licensed. Laurikari wrote his thesis
[2] on the latter. TRE is the better of the two libraries, IMHO,
because it's single-pass and can work over arbitrary kinds of input
stream (like characters in a gap buffer). TRE's approximate matching
support is occasionally useful as well.

That said, I'd actually prefer to head in the other direction and
allow users to express arbitrarily rich grammars using an extended rx
syntax. The idea is basically to support parser combinator grammars,
which can be efficiently matched using a scannerless GRL parser, which
is O(N^3) time, or a memozing and backtracking "packrat" parser, which
is O(N) time and O(n) space. The end result would look a bit like Perl
6's rules.

I have some ultra-preliminary work at

Basically, instead of matching regular expressions more efficiently,
we should add support for efficient parsing using
declaratively-constructed grammars, of which regular expressions are
simply a special case. Such support would be incredibly useful for
processing languages like Perl and C++.

[1] http://laurikari.net/tre/about/
[2] http://laurikari.net/ville/regex-submatch.pdf

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]