[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Char-folding: how can we implement matching multiple characters as a

From: Clément Pit--Claudel
Subject: Re: Char-folding: how can we implement matching multiple characters as a single "thing"?
Date: Mon, 30 Nov 2015 17:49:35 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 11/30/2015 04:54 PM, Artur Malabarba wrote:
> As you can see. The number of branches will scale badly with the 
> number of input characters. Which means the number of output 
> characters will be much more than 40 times the input. And even if
> the next character does NOT combine with 'x', it's hard to write an 
> algorithm that can identify that and close all branches before 
> continuing with the conversion.
> And that is why I disabled this code for now. With a short string 
> like "endless-visit-notifications" it was return regexps longer than 
> 10k characters. Which is more than Emacs handle.

Hi Arthur,

Thanks for your efforts on this; it would be very convenient for "ff" or "fi" 
to match their ligature counterparts.

Are the difficulties due to the sheer size of the regular expressions, or to 
their complexity (too much branching?). If the size is the real issue (and if 
that size mostly comes from repeated folding of individual characters), then I 
can think of two options:

* Extend the C-level implementation of regular expressions to make 
character-folding a capability of the C engine, allowing for succinct regexps + 
a flag.
* Add a new backslash to regular expressions for matching char-folded 
characters (custom character categories almost already allow for this, so the 
implementation would probably be similar).

These two solutions would not help with the branching, but they would allow for 
smaller input to the regexp engine (again, assuming that the size issue is due 
to the large brackets, not to the branching).

On the other hand, if excessive branching is the issue (either due to the size 
of the regexps even with the ideas above applied, or because the engine has 
trouble with all the backtracking), then maybe normalization can help: instead 
of searching the actual buffer, the regexp engine could be made to search the 
buffer itself, and a copy of it with all text normalized, for some "good" 
definition of normalization. So looking for "iff" in "difficult" would really 
look for "iff" in both "difficult" and "difficult". Since normalization is a 
per-character operation, this does not require maintaining a second copy of the 
buffer: instead, one can just feed two streams of characters to the regexp 
engine: the un-normalized one, and the normalized one.

As a side note, it may also be possible to track at the buffer level whether 
the whole file is ascii; if it is, then case folding can be omitted, making any 
performance degradation stemming from character folding more acceptable. One 
can also use finer-grained control, especially if character folding happens at 
the C level in the regexp engine.


Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]