emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character group folding in searches


From: Artur Malabarba
Subject: Re: Character group folding in searches
Date: Fri, 6 Feb 2015 14:18:39 -0200

> The full set of "folding" transformations is described in the Unicode
> technical report UTR #30.  It was withdrawn, but its last draft is
> still enlightening.
>
> I think we should support some subset of what's described there.
>
> The way to do it IMO is to generate a set of char-tables where each
> character is mapped to its folded variant,
> one char-table for each subset of folding.

Although the attached patches only define one table for now, they all
support multiple tables (even the one that's not based on char-tables)
so the sky is the limit. For this reason, this detail probably won't
be an obstacle so we can decide later which subset of foldings we want
to provide by default.

> A character whose folding is not a single
> character should map to a vector or a string of characters (not sure
> which one is best, we should choose the one that lends itself to the
> most efficient use).
> I think the best approach is to modify search.c to be able to handle
> folding that produces more than a single character.  I think we will
> also need search.c to support several alternative foldings for the
> same search operation.  Making these changes would be relatively easy,

It's certainly doable, but I'm not sure it's easy. The `search_buffer'
function seems pretty focused on handling 1 char at time. Having a
single char suddenly turn into two might require significant changes
to the code flow.

Of course, if someone takes that up that's great!

>> * group-folding-with-regexp-lisp.patch
>>
>> This one takes each input character and either keeps it verbatim or
>> transform it into a regexp which matches the entire group that this
>> character represents. It is implemented in isearch.
>>
>> + It trivially handles goals 1, 2 and 3. Because regexps are quite
>> versatile, it is the only solution that handles item 3 (it allows each
>> character to match more than a single character).
>
> But the downside is that we will have to construct such regexps for
> all the foldings of all the characters we want to support.  That will
> be quite a large database, and a lot of work to construct it.

It's only a tiny bit more work than generating case-tables that are
also under discussion. Any information available to construct the case
tables is also available for building the regexps.


>> * group-folding-with-case-table-lisp.patch
>>
>> This patch is entirely in elisp. I've put it all inside `isearch.el'
>> for now, for the sake of simplicity, but it's not restricted to
>> isearch.
>>
>> It creates a new case-table which performs group folding by borrowing
>> the case-folding machinery, so it is very fast. Then, group folding
>> can be achieved by running the search inside a `with-group-folding`
>> macro. There's also an example implementation which turns it on for
>> isearch by default.
>>
>> + It immediately satisfies items 1, 2, 4, and 5.
>> + It is very fast.
>> - It has no simple way of achieving item 3.
>
> It could use a separate case-table for item 3, couldn't it?

Not that I can tell. You either need to tell emacs to either (1)
ignore the accute entirely, or (2) have the "a´" pair of characters
fold into "a". Case tables just can't do this right now AFAIK.

> I think we will need separate tables for different foldings anyway,
> because each use case calls for some specific folding.  In isearch,
> the user will have to specify which foldings she wants to be in
> effect.

Yes, multiple tables are fine and will be done regardless of the approach taken.

>> - If the user decides to set `group-fold-search' to t, this can break
>> existing code (a disadvantage that the lisp version above does not
>> have).
>> - It adds two extra fields to every buffer object (the boolean
>> variable and the char table).
>
> I'm not sure we need to add these tables to the buffer object.  The
> experience with using case-tables this way is not encouraging, because
> in several important cases it is not at all clear which buffer is
> relevant to the folding-match operation one needs to do.

Yes, I don't like this either. I was threading unknown waters here, so
I just tried to stays as close as possible to what case-fold-search
does.

>> Do any of these options seem good enough? Which would you all like to 
>> explore?
>> I like the second one best, but goal 3 is quite important.
>
> I think we must lift the limitation of single-character folding
> result, which means changes on the C level are inevitable.

I agree this is important. But if no one takes it up I'd rather have
single-character folding than none at all.

> I also think we need to talk a bit more about which kinds of folding
> we would like to support.

What do you mean? Which folding subsets to provide by default?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]