[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fixing ill-conditioned regular expressions. Proof of concept.

From: Alan Mackenzie
Subject: Fixing ill-conditioned regular expressions. Proof of concept.
Date: Mon, 23 Feb 2015 18:12:05 +0000
User-agent: Mutt/1.5.22 (2013-10-16)

Hello, Emacs.

Please refer to Martin Rudalics's bug #19846.  Briefly, in a C Mode
buffer, with auto-fill-mode enabled, auto-repeating the spacebar at the
beginning of a comment line caused Emacs to freeze for a long time (up to
several minutes).

The reason for the freeze was an ill-conditioned regexp being constructed
by CC Mode and forward-paragraph, and this being used in regexp searches.
Here, ill-conditioned means the regexp had several subexpressions
concatenated, each of which matches arbitrary amounts of whitespace.
This causes the backtracking algorithm in the regexp engine to go crazy.

One solution would be for hackers to write better regexps.  But since the
pertinent one here is constructed partly from a user configuration, this
is difficult to enforce.  It's also difficult when a core function like
forward-paragraph uses strings supplied by arbitrary packages.

Another solution is to fix these ill-conditioned regexps, which is the
approach taken here.  The supplied file fix-re.el contains algorithms
which convert the regexp to an internal tree form, analyse and, if
necessary, correct the tree, then convert back to a regexp string.

There are two entry points to the code:
(i) fix-re, which given a regexp returns either the corrected form or
  the same regex; it caches supplied regexps, so that fixing/non-fixing
  only need be done once for each supplied regexp.
(ii) fix-re-test, a slightly lower level function bypasses the cache and
  simply tests the regexp, returning the corrected version or nil.

The runtime of fix-re is minimal - around 2ms, as measured by elp.

To use fix-re.el, byte-compile and load it.  Apply the following patch to

diff --git a/lisp/textmodes/paragraphs.el b/lisp/textmodes/paragraphs.el
index 8bcc71e..45aa95d 100644
--- a/lisp/textmodes/paragraphs.el
+++ b/lisp/textmodes/paragraphs.el
@@ -247,7 +247,8 @@ Returns the count of paragraphs left to move."
                      fill-prefix-regexp "[ \t]*$")
         ;; This is used for searching.
-        (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
+        (sp-parstart (fix-re
+         (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)")))
         start found-start)
     (while (and (< arg 0) (not (bobp)))
       (if (and (not (looking-at parsep))

.  With this in place, the freezing observed in bug #19846 no longer

The particular ill-conditioned sp-parstart which caused bug #19846 was

                           2             1    1            3             1
    ^[ \t]*\(?:[ \t]*\(//+\|\**\)[ \t]*$\|^^L\|[ \t]*\(//+\|\**\)[ \t]*$\|^^L\)
            1         2         2                     3         3             1
                              ^                               ^

.  Note the two expressions matching arbitrary amounts of WS sandwiching
\(//+\|\**), which also matches the empty string.  Note in particular,
the marked character.  (N.B. Note also the repetition which results from
the way the regexp was constructed, which will double the execution time
of the RE engine here.)

Here is this regexp after processing by fix-re:

                         3                1    1                           1
    ^[ \t]*\(?:\(?:\(//+\|\*+\)[ \t]*\)?$\|^^L\|\(?:\(//+\|\*+\)[ 
            1   2   3       ^ 3       2          4   5      ^  5       4        

.  Note how the subexpression 3 has become @dfn{de-emptified} - it no
longer matches the empty string, but matches everything else that the
original subexp did (and nothing more).

Abstractly, R*(R*ER*).... (where E is the subexp matching the empty
string) has been converted to R*((address@hidden)?....), where E@ is the
de-emptification of E.

fix-re.el is currently a proof of concept, and is in a somewhat rough
state.  Comments and doc strings need attention, and the coding style is
not consistent throughout the file.  An additional transformation to
remove repeated alternatives from within \(..\) is also wanted.  But most
importantly, \{..\} expressions aren't handled at all.

I suggest that fix-re.el get incorporated into Emacs, after being tidied


Alan Mackenzie (Nuremberg, Germany).

Attachment: fix-re.el
Description: Text document

reply via email to

[Prev in Thread] Current Thread [Next in Thread]