[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: idn.el and confusables.txt

From: Ted Zlatanov
Subject: Re: idn.el and confusables.txt
Date: Sat, 14 May 2011 20:22:44 -0500
User-agent: Gnus/5.110018 (No Gnus v0.18) Emacs/24.0.50 (gnu/linux)

On Sat, 14 May 2011 23:59:22 +0300 Eli Zaretskii <address@hidden> wrote: 

>> Let's say C1, C2, and C3 are confusables mapped to C1.  Then the mapping
>> is C1 -> (C2, C3); C2 -> C1; and C3 -> C1.
>> The algorithm is "if a character maps to an atom it's confusable with
>> it, if it maps to a list the whole lisp is confusable to this
>> character."

EZ> Should it be a list or a string?  How would you use this mapping?

It could be any type of sequence, I guess.  Strings are more compact but
for small amounts of data (typically 1-3 characters) I'm not sure if
that matters.  For 1 character in particular I'm pretty sure it's more
efficient to store the character directly than any sequence.

markchars.el would use it as follows: look at all the characters of a
word.  If any are of a different script S2 from the majority script S1,
highlight them (we do this now with `markchars-face-confusable').

New functionality: now if any of the S2 characters are multi-script
confusables that map to a character in the majority script S1, highlight
them specially with the new variable
`markchars-face-confusable-multi-script' and give them a tooltip to say
they are confusable with a particular character.

New functionality: if any of the word characters, regardless of script,
are confusables of the single-script type, highlight them with
`markchars-face-confusable'.  But see below about normalization.

EZ> The RHS of a mapping can be several characters, in which case there's
EZ> no reverse mapping and no "confusables mapped to a character", I
EZ> think.

OK.  I was thinking of using the transitivity information but that's not
very useful so never mind.

>> In addition to the character mapping we also need a confusable data
>> type, which can be SL/SA (single-script) or ML/MA (mixed-script).

EZ> What would be a possible use of that?

Single-script confusables can be an accident and are usually due to
combining, e.g. parenthesized numbers:

2485 ;  0028 006C 0038 0029 ;   SL      #* ( ⒅ → (l8) ) PARENTHESIZED NUMBER 

...although there are many cases where that's not true:

0399 ;  0031 ;  SA      # ( Ι → 1 ) GREEK CAPITAL LETTER IOTA → DIGIT ONE       
# →l→
0417 ;  0033 ;  SA      # ( З → 3 ) CYRILLIC CAPITAL LETTER ZE → DIGIT THREE    

As a general rule I'd say that if the mapping is to a single character
with the SL/SA single-script property, chances are it's a true
confusable.  Otherwise it could be legitimate and we'd need to convert
the string to a normalized form, which is probably slow (do you know?)

Mixed-script confusables are more dangerous because they look exactly
like the other character and are less likely to be an accident, e.g.
FF01 ;  0021 ;  ML      #* ( ! → ! ) FULLWIDTH EXCLAMATION MARK → EXCLAMATION 
MARK      # →ǃ→
0430 ;  0061 ;  ML      # ( а → a ) CYRILLIC SMALL LETTER A → LATIN SMALL 
LETTER A      # 

so I would make them more noticeable and would skip any normalization.
Thus my new functionality proposals above.

There are also whole-script confusables, e.g. "scope" in Latin and
"scope" in Cyrillic (example from http://unicode.org/reports/tr39/) but
I think those are covered by the rules above already and don't merit
special treatment.

Finally, confusables.txt has transitivity mappings that explain how the
mapping was derived.  I don't think that's particularly useful for
markchars.el.  I can't think of any other uses for the confusables.txt
data beyond the listed above.

Based on all this, I think it's best to make the confusables char-table
values atoms or sequences (strings or lists) but split them into two
char-tables for the single-script and multi-script mappings.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]