[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: split-function
From: |
Tom Lord |
Subject: |
Re: split-function |
Date: |
Mon, 9 Apr 2001 17:07:12 -0700 (PDT) |
Within the following circumstances the split-function seems to
run endless.
[
expression: /\*([^*]|([*]+[^/]))*\*/
string: //**********************************//
]
Wow -- that's a reasonably realistic example of the famous problems
that make Posix regexp matchers difficult to implement correctly
and well.
rx-posix (from regexps.com) seems to handle that expression well. It
comes with extensive Posix tests, including some that other matchers
(including GNU regex) don't pass. On the other hand, it lacks GNU
extensions. It doesn't handle Unicode, but the low level engine
handles UTF-8 and all flavors of UTF-16, so there seems to be a finite
amount of straightforward work to get there.
I haven't tested the expression with the Tcl matcher, but would guess
that it also does well. The Tcl matcher also lacks GNU extensions.
It handles Unicode with a UTF-8 encoding; handling UTF-16 or 32
variations seems to be a finite amount of straightforward work. It
has a few bugs and/or its author disagrees with me about what the
Posix spec means.
The combination of dfa.c and regex.c has a steep drop-off from
fast expressions to slow expressions. The two matchers mentioned
above optimize a lot of cases that dfa.c can't handle, and that regex.c
handles slowly.
There's no telling how many scripts would break by switching to a
correct Posix matcher -- none, many, or something in between. There
also seems to be disagreement on precisely what the Posix spec means
-- though my interpretation (in rx-posix) is, of course, the correct
one :-)
Thomas Lord
- split-function, Pit Kreiner, 2001/04/09
- Re: split-function,
Tom Lord <=