Regular expression libraries

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Regular expression libraries

From:	Clément Pit--Claudel
Subject:	Regular expression libraries
Date:	Thu, 15 Dec 2016 14:00:25 -0500
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1

Hi emacs-devel,

The top of regex.c says this:

    /* Ignore some GCC warnings for now.  This section should go away
       once the Emacs and Gnulib regex code is merged.  */

And indeed I've seen brief discussions of this merge in the past 
(https://lists.gnu.org/archive/html/emacs-devel/2002-04/msg00083.html , 
https://lists.gnu.org/archive/html/emacs-devel/2016-07/msg01115.html).  Looking 
at this is more detail, though, it is not clear how such a merge could happen:

* Emacs has special regexp features based on syntax classes, or the position of 
the point.  These features don't seem to be supported in gnulib nor glibc
* Emacs uses a gap buffer, whereas gnulib and glibc expects a char* for the 
subject string

Stefan also expressed a desire to move to a better regular expression engine — 
ideally one that uses a non-backtracking implementation when possible, to avoid 
the exponential blowups that Emacs currently suffers from on certain regular 
expressions.

I've had a quick look at the existing options.  Here's what I gathered (the 
list below includes only libraries whose licensing is compatible with Emacs):

* There are only two usable, non-backtracking regexp libraries:
  - RE2 (from Google, C++) doesn't support any form of backtracking, though, so 
"\\(.*\\)\1" won't work anymore.  Additionally, RE2 does not support matching 
over a gap buffer (the subject string needs to be a contiguous chunk of 
memory).  Discussion here: https://github.com/google/re2/issues/126
  - TRE (http://laurikari.net/tre/) uses backtracking if needed, but otherwise 
uses a finite automaton.  It's stable, can be relatively easily linked into 
Emacs, and performance is similar to what we have (I benchmarked it).  It does 
support matching on a gap buffer (it has an interface to use the moral 
equivalent of a random iterator, reguexec).  It doesn't support Emacs' native 
encoding out of the box, though, and the library is unmaintained; there are 
multiple forks which fix various security vulnerabilities.
  In both cases, the syntax isn't compatible with that of Emacs, which would 
force us to write a compatibility layer.  And Emacs' internal encoding isn't 
supported either.

* Backtracking implementations are more common, but none seem to fit the bill:
  - Oniguruma and Onigmo (C, used in Ruby, the Atom text editor, and others) 
support a wide range of encodings, including allegedly Emacs' internal 
encoding, as well as many syntaxes, including Emacs' syntax.  But they don't 
support matching on a gap buffer either (they take a contiguous string).  
Discussion here: https://github.com/k-takata/Onigmo/issues/83
  - PCRE (C, used in Perl and others) doesn't have Emacs encoding, nor Emacs 
syntax.  It requires a contiguous buffer, but it does support partial matches.  
Partial matches can apparently be restarted, which could almost work in our 
case, but the semantics of restarting a search from a partial match are broken.
  - Boost (C++) can use a custom random-access iterator for the subject string, 
which should work for gap buffers; but I'm not sure how good the encoding story 
is.

If I understand our current implementation, our regexp.c functions take the 
moral equivalent of two strings, and match over that, pretending that it's just 
one large string.  In practice, these two strings are the two halves of the gap 
buffer.  Correct?

The bottom line is that, even though it should be relatively easy to change 
string-match-p to use onigmo, changing re-search-forward sounds tricky.  Did I 
miss something?  Was there a concrete plan for merging with glibc, or for using 
an automaton-based matching strategy?  The advantages of switching could be 
support for a more common syntax, more flexible regular expressions (named 
groups, assertions, …), better performance, and better handling of special 
cases.

Cheers,
Clément.

smime.p7s
Description: S/MIME Cryptographic Signature

[Prev in Thread]

Current Thread

[Next in Thread]

Regular expression libraries, Clément Pit--Claudel <=
- Re: Regular expression libraries, Eli Zaretskii, 2016/12/15
- Re: Regular expression libraries, Paul Eggert, 2016/12/15
  - Re: Regular expression libraries, Andreas Schwab, 2016/12/15
    - Re: Regular expression libraries, Paul Eggert, 2016/12/16
    - Re: Regular expression libraries, Clément Pit--Claudel, 2016/12/16
    - Re: Regular expression libraries, Clément Pit--Claudel, 2016/12/16
    - Re: Regular expression libraries, Lars Ingebrigtsen, 2016/12/16
    - Re: Regular expression libraries, Clément Pit--Claudel, 2016/12/16
    - Re: Regular expression libraries, Eli Zaretskii, 2016/12/16
    - Re: Regular expression libraries, Paul Eggert, 2016/12/16

Prev by Date: Re: Skipping unexec via a big .elc file
Next by Date: Re: Regular expression libraries
Previous by thread: Re: [Emacs-diffs] master 09a66ce: Fix circular list handling in seq-mapn
Next by thread: Re: Regular expression libraries
Index(es):
- Date
- Thread