emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regular expression libraries


From: Stefan Monnier
Subject: Re: Regular expression libraries
Date: Fri, 16 Dec 2016 00:15:18 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux)

> * Emacs has special regexp features based on syntax classes, or the position 
> of the point.  These features don't seem to be supported in gnulib nor glibc
> * Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the 
> subject string

The gnulib regexp code used to be a derivative of the code we have in
Emacs (or rather they both derived from the same codebase).
Since then it's been replaced with a completely different codebase.

But IIRC the two did merge back (or close enough) before gnulib' code
was replaced with something else.

> Stefan also expressed a desire to move to a better regular expression
> engine — ideally one that uses a non-backtracking implementation when
> possible, to avoid the exponential blowups that Emacs currently
> suffers from on certain regular expressions.

>From my point of view, the main problem with the current code is the
blowup in pathological cases, and it's the only one that justifies moving
to another implementation.

Some users may want to change the engine so as to get new features, but
FWIW, I think it'd be much easier to just add the features to the
current engine.  The main reason such features are currently lacking is
that I've been resisting their addition because I don't want such
features to later make it harder to switch to a non-backtracking engine.

> * There are only two usable, non-backtracking regexp libraries:

What about the one in glibc?

>   - TRE (http://laurikari.net/tre/) uses backtracking if needed, but
>     otherwise uses a finite automaton.  It's stable, can be relatively
>     easily linked into Emacs, and performance is similar to what we
>     have (I benchmarked it).  It does support matching on a gap buffer
>     (it has an interface to use the moral equivalent of a random
>     iterator, reguexec).  It doesn't support Emacs' native encoding
>     out of the box, though, and the library is unmaintained; there are
>     multiple forks which fix various security vulnerabilities.

So, this would be acceptable, but would mean we'd still use "our own
code" and maintain it ourselves.

> If I understand our current implementation, our regexp.c functions
> take the moral equivalent of two strings, and match over that,
> pretending that it's just one large string.  In practice, these two
> strings are the two halves of the gap buffer.  Correct?

That's right.


        Stefan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]