Re: On language-dependent defaults for character-folding

I think your message illustrates an opinion that is not only mine, in that I am not against the idea of character folding. I mean, if I were, I'd just ignore this discussion and just turn the feature off. What I want, and by the looks of things, other people too, is to actually have this feature. I just don't want it to be broken, and today it is broken because it' been implemented based on incorrect assumptions.

On 20 Feb 2016 14:32, "Lars Ingebrigtsen" <address@hidden> wrote:

> It seems to me that we're considering using the Unicode decomposition
> rules for "variant detection" because it's what we have. But this
> doesn't allow people to say `C-s l' to find ł or `C-s o' to find ø, and
> this would obviously be something that many people would find helpful.

The Unicode collation charts do place ø in the "o" category. Eli said in an earlier message that the collation charts were consulted, but when I test that doesn't seem to be the case.

The Unicode character collation charts is the best generic solution that Unicode gives us.

The proposal you put forward below seems very much like what I proposed earlier; having the locale-dependent rules determine any exceptions and then fall back to a generic method.

The question is what that generic should be. The current trick of decomposing and using the first character of the decomposition is not good and breaks down very quickly. Clearly the collation charts should be consulted instead, but this is not enough. I could spend quite some time discussing all the issues that I can think of (to get an idea of it, look up how Korean and Devanagari works, as well as the concept of "grapheme clusters").

> So the Unicode decomposition rules only get us halfway there. On the
> other hand, they go to far for other users, who absolutely do not want
> `C-s o' to find ø, but would be really glad if `C-s hermes' would find
> "Hermés" (or is it "Hermès"? I can't even type
> So: How many characters are we really talking about? Unicode is big and
> scary, but this only applies to alphabetical scripts, right? That is,
> all the Latin-like scripts, and... possibly Greek/Hebrew/Cyrillic? I
> don't know?

Cyrillic has the issues. Also, most of the accented characters in Cyrillic are historical and not used today. Therefore having this feature in Cyrillic would most definitely be useful.

> But if we only consider the Latin scripts for a moment, there aren't
> more than a few hundred Unicode points that we care about. Basically
> all the old iso-8859-foos from around Europe. And what we want is a way
> for people with normal keyboards (they have a-z in Latin alphabet
> countries) to search for variants.

It's more than that, because it's not just single characters we're talking about but also combinations. Of course, for European languages this can be handled by comparing only the base character but in other languages this is a much more complex issue.

That said, I agree with you on your proposed approach.

> That bit is more than an evening, but is something that people would
> enjoy submitting exceptions to, I think.

You can count me in. :-)

> And then we just look up the locale, create the mapping when we type
> `C-s', and there we are. An awesome, very useful feature that would
> annoy nobody, and that should be on by default.

That would be amazing.

Regards,
Elias

From:	Elias Mårtenson
Subject:	Re: On language-dependent defaults for character-folding
Date:	Sat, 20 Feb 2016 17:18:27 +0800