[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#192: regexp does not work as documented
From: |
Thomas Lord |
Subject: |
bug#192: regexp does not work as documented |
Date: |
Sun, 11 May 2008 20:30:00 -0700 |
User-agent: |
Thunderbird 1.5.0.5 (X11/20060808) |
Stefan Monnier wrote:
I have most of the DFA construction code written, but I may take you up
on that anyway. BTW, regarding the "very expensive in the worst case",
how common is this worst case in real life?
Talking about general purpose use of an engine, as opposed to something
more narrow like "likely font-lock expressions":
It's common enough in "real life" (in my experience) to be exactly the
minimal amount required to make it annoying.
Let me give you some rules of thumb to describe what I mean but I'll
refrain from a long treatise explaining the theory that supports these.
If you have questions I can "unpack" this on list or off.
First, be careful to make a clear distinction between (let's dub the
distinction) "regexps vs. regular expressions". "Regular expressions"
correspond to formally regular languages. Regular expressions can
always be converted to a DFA. "regexps" are what most people use
in most situations. regexps include things like features to extract the
positions that match a sub-pattern. regexps do *not* correspond to
regular languages and *can not* always be converted to DFA. DFA
conversion can help with regexp matching, but it can't solve it completely.
Moreover, you can make pathological regexps (very slow to match) for
every regexp I know of. Henry Spencer (I've heard) asserts that regexps
are NP complete or something around there, though I haven't seen the
proof. (Rx is a hybrid regular expression and regexp engine based on an
on-line (caching, incremental) DFA conversion suggested in the "Dragon"
compiler book and originating from (as I recall) an experiment by
Thompson and one of the other Bell Labs guys. One of whoever it was
privately mentioned that they abandoned it themselves because it was taking
too much code to implement, or something like that.)
That aside, regular expressions are probably plenty for font-lock
functionality,
so let's just talk about those. On-line DFA conversion for those is
likely
practical by multiple means.
The size of a DFA is, worst-case, exponential in the size of the regular
expression.
This is the source of pathological cases that can thwart any DFA
engine. (In exponential
cases, Rx keeps running but more like an NFA, taking a corresponding
performance hit.)
For a time, I was pushing hard on Rx, trying to use it for absolutely
everything.
To make up new problems to solve as in "Ok, let's suppose I can run
X-Kbyte long
regular expressions with lots of | operators, stars, etc. What can I
apply that to?"
One example is lexing for real-world computer languages.
[Aside: you can also use the same engine as the core of a shift-reduce
parser, of course -- and I
did that a long time ago with Guile, to useful effect.]
Almost always, the expressions were such that Rx would have no problem
and give
very pleasing results. However, there were definitely times (for huge
regular expressions
and smaller ones) when things would just absolutely crawl. They happen
"just often
enough" to be an annoyance.
Now, every case of that annoyance that I found (in real-life
applications) had a solution.
I could think for a while on *why* the regular expression I wrote was
blowing up and
then think of a different approach that eliminated the problem. Most
often that
meant not a different but equivalent regular expression -- it meant
going one level up and
changing the way the caller used regular expressions. The caller would
retain the same
functionality but different demands would be made on the regular
expression (or regexp) engine.
And that makes it tricky to drop *blindly* into Emacs. To solve the
unavoidable annoying cases
you really had to know how regular expressions worked at a deep level.
To diagnose
and work around the pathological cases took some expertise.
As a general rule, for something "general purpose" like a libc
implementation or the
default regexp functions in Emacs lisp, there is something to be said
for using NFA matchers.
It's perverse why:
Vast swaths of things that a DFA matcher can do very quickly an NFA
matcher can not.
It's reliably slow. Therefore, people not prepared to think hard about
regexps tend to use
an NFA matcher in pretty limited ways. A powerful regular expression
engine could
simplify their code, at the cost of risking finding a "hard case" -- but
there's no issue since
people quickly give up and don't try to push the match engine that hard.
More concisely:
If you use a DFA matcher You Better Know What You're Doing but also
Most of the Time You Won't Need To So It's Very Convenient.
If you use an NFA matcher You Can Get Away With Not Really Knowing What
You're Doing but also You Won't Be Trying Anything Fancy, Perhaps Losing
Out.
One approach to toss into the mix, this one suggested by Per Bothner
years ago,
is to consider *offline* DFA conversion (a la 'lex(1)'). The
advantage of offline
(batch) conversion is that you can burn a lot of cycles on DFA
minimization and,
if your offline converter terminates, you've got a reliably linear
matcher. The
disadvantages for *many* uses of regular expressions in Emacs should be
pretty obvious.
For something like font-lock, where the regular expressions don't change
that often,
that might be a good approach -- precompile a minimal DFA and then add
support
for "regular expression continuations" when using those tables.
-t
- bug#192: regexp does not work as documented, (continued)
- bug#192: regexp does not work as documented, David Koppelman, 2008/05/11
- bug#192: regexp does not work as documented, Stefan Monnier, 2008/05/11
- bug#192: regexp does not work as documented, David Koppelman, 2008/05/11
- bug#192: regexp does not work as documented, Stefan Monnier, 2008/05/11
- bug#192: regexp does not work as documented, David Koppelman, 2008/05/12
- bug#192: regexp does not work as documented, Stefan Monnier, 2008/05/12
- bug#192: regexp does not work as documented, David Koppelman, 2008/05/12
- bug#192: regexp does not work as documented, Stefan Monnier, 2008/05/11
- bug#192: regexp does not work as documented, Thomas Lord, 2008/05/11
- bug#192: regexp does not work as documented, Stefan Monnier, 2008/05/11
- bug#192: regexp does not work as documented,
Thomas Lord <=
- bug#192: regexp does not work as documented, Stefan Monnier, 2008/05/12
- bug#192: regexp does not work as documented, Thomas Lord, 2008/05/12
- bug#192: regexp does not work as documented, tomas, 2008/05/12