[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: idn.el and confusables.txt

From: Eli Zaretskii
Subject: Re: idn.el and confusables.txt
Date: Sun, 15 May 2011 01:56:02 -0400

> From: Ted Zlatanov <address@hidden>
> Date: Sat, 14 May 2011 20:22:44 -0500
> EZ> Should it be a list or a string?  How would you use this mapping?
> It could be any type of sequence, I guess.  Strings are more compact but
> for small amounts of data (typically 1-3 characters) I'm not sure if
> that matters.  For 1 character in particular I'm pretty sure it's more
> efficient to store the character directly than any sequence.
> markchars.el would use it as follows: look at all the characters of a
> word.  If any are of a different script S2 from the majority script S1,
> highlight them (we do this now with `markchars-face-confusable').
> New functionality: now if any of the S2 characters are multi-script
> confusables that map to a character in the majority script S1, highlight
> them specially with the new variable
> `markchars-face-confusable-multi-script' and give them a tooltip to say
> they are confusable with a particular character.
> New functionality: if any of the word characters, regardless of script,
> are confusables of the single-script type, highlight them with
> `markchars-face-confusable'.  But see below about normalization.

These all examine portions of a buffer ("words") for being a match to
some string or regexp.  So I think having strings in the char-table
will be more convenient, because you could then use looking-at,
string=, string-match, etc.

> As a general rule I'd say that if the mapping is to a single character
> with the SL/SA single-script property, chances are it's a true
> confusable.  Otherwise it could be legitimate and we'd need to convert
> the string to a normalized form, which is probably slow (do you know?)

What do you mean by "normalized form"?

> Based on all this, I think it's best to make the confusables char-table
> values atoms or sequences (strings or lists) but split them into two
> char-tables for the single-script and multi-script mappings.

If we were to implement the full IDNA protocol, would the above be
enough?  Or will we need additional information?

reply via email to

[Prev in Thread] Current Thread [Next in Thread]