[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Char-folding: how can we implement matching multiple characters as a
Re: Char-folding: how can we implement matching multiple characters as a single "thing"?
Mon, 30 Nov 2015 17:49:35 +0100
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0
On 11/30/2015 04:54 PM, Artur Malabarba wrote:
> As you can see. The number of branches will scale badly with the
> number of input characters. Which means the number of output
> characters will be much more than 40 times the input. And even if
> the next character does NOT combine with 'x', it's hard to write an
> algorithm that can identify that and close all branches before
> continuing with the conversion.
> And that is why I disabled this code for now. With a short string
> like "endless-visit-notifications" it was return regexps longer than
> 10k characters. Which is more than Emacs handle.
Thanks for your efforts on this; it would be very convenient for "ff" or "fi"
to match their ligature counterparts.
Are the difficulties due to the sheer size of the regular expressions, or to
their complexity (too much branching?). If the size is the real issue (and if
that size mostly comes from repeated folding of individual characters), then I
can think of two options:
* Extend the C-level implementation of regular expressions to make
character-folding a capability of the C engine, allowing for succinct regexps +
* Add a new backslash to regular expressions for matching char-folded
characters (custom character categories almost already allow for this, so the
implementation would probably be similar).
These two solutions would not help with the branching, but they would allow for
smaller input to the regexp engine (again, assuming that the size issue is due
to the large brackets, not to the branching).
On the other hand, if excessive branching is the issue (either due to the size
of the regexps even with the ideas above applied, or because the engine has
trouble with all the backtracking), then maybe normalization can help: instead
of searching the actual buffer, the regexp engine could be made to search the
buffer itself, and a copy of it with all text normalized, for some "good"
definition of normalization. So looking for "iff" in "diﬃcult" would really
look for "iff" in both "diﬃcult" and "difficult". Since normalization is a
per-character operation, this does not require maintaining a second copy of the
buffer: instead, one can just feed two streams of characters to the regexp
engine: the un-normalized one, and the normalized one.
As a side note, it may also be possible to track at the buffer level whether
the whole file is ascii; if it is, then case folding can be omitted, making any
performance degradation stemming from character folding more acceptable. One
can also use finer-grained control, especially if character folding happens at
the C level in the regexp engine.
Description: OpenPGP digital signature