[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: highlighting non-ASCII characters

From: Ted Zlatanov
Subject: Re: highlighting non-ASCII characters
Date: Wed, 24 Mar 2010 05:05:35 -0500
User-agent: Gnus/5.110011 (No Gnus v0.11) Emacs/24.0.50 (gnu/linux)

On Tue, 23 Mar 2010 22:47:37 -0400 Stefan Monnier <address@hidden> wrote: 

>> show-nonascii-characters: t, 'majority-paragraph, majority-line,
>> 'minority-line, 'minority-paragraph,
>> 'suspicious, a function, or nil (default)

SM> The name is wrong, I think: I'd probably want to highlight ASCII chars
SM> that are out of place, just as with non-ASCII chars.

Although Unicode calls them "confusable" I think that's a terrible name.
So how about show-out-of-place-glyphs as an alist and 'homoglyphs as
a key option (see http://en.wikipedia.org/wiki/Homoglyph and, amazingly,

show-out-of-place-glyphs: alist; keys can be 'ascii, 'nonascii,
or 'homoglyphs.  Maybe we can also allow a general regex.

Values can be 'always, 'majority-paragraph, majority-line,
'minority-line, 'minority-paragraph, 'suspicious (with the same rules I
proposed earlier).  A function should also be possible.  Optional second
value is a face, defaulting to `out-of-place-glyph'.

That lets us map an interesting class of characters to a heuristic that
determines whether they are out of place.

So Stefan might have (and this could be the Emacs default)

(setq show-out-of-place-glyphs '(homoglyphs suspicious))

but I would have

(setq show-out-of-place-glyphs '(nonascii always face1))

which includes Stefan's setting.

SM> Also, I'm not sure if proportion compared to total text (or line) is
SM> a good metric to decide whether it's suspicious.  I don't have much
SM> better to suggest, tho.

I based it on what I would find useful.  I think the majority of people
will want 'suspicious and let Emacs choose a default.  So maybe the
{majority,minority}-* options are superfluous.

On Wed, 24 Mar 2010 06:20:51 +0200 Eli Zaretskii <address@hidden> wrote: 

EZ> If we go for such a metric, it would need to be augmented by a
EZ> database of words where a small number of such characters is
EZ> ``normal'', not to be highlighted.  This is for words like naïve.
EZ> Otherwise the feature will be an annoyance.

That's in the extended ASCII charset which would probably be included in
the ASCII definition above, although it certainly has homoglyphs to
upper-range Unicode (I revised the proposal to distinguish between
highlighting homoglyphs and non-ASCII).  I think regular English doesn't
have many common words that would be outside the extended ASCII charset.

On Wed, 24 Mar 2010 13:14:13 +0800 Jason Rumney <address@hidden> wrote: 

JR> It's also dependent on which characters they are - Cyrillic, Han,
JR> Greek, Hebrew etc should be expected to appear in long runs, perhaps
JR> with runs of ASCII and/or other characters interleaved.  Latin-1 on
JR> the other hand would normally appear individually or in very short
JR> runs mixed in with ASCII.

Agreed, and that can be fine-tuned.

JR> There is no single heuristic that can be used to identify "suspicious"
JR> characters.

So we'll provide several.  I'd rather have something useful than try to
make it perfect.

On Tue, 23 Mar 2010 19:09:18 -0700 "Drew Adams" <address@hidden> wrote: 

>> What I'm saying is that there are two issues: non-ASCII chars in
>> general (which I personally don't want to display in any special
>> manner: they're just as normal as ASCII chars), and then there are
>> "chars that are out of place or that may not be what they look like",
>> such as the weird "K" in the other message's "OK" (which to me, is
>> similar to the NBSP char in that it is meant to be displayed in the
>> same way as some other char, so we want to call the attention of the
>> user to the difference).

I hope you'll go along with "homoglyphs" as I propose, I think that's
what you mean :)

On Wed, 24 Mar 2010 14:00:47 +0900 "Stephen J. Turnbull" <address@hidden> 

SJT> There were long threads on Python-dev about this with respect to the
SJT> PEPs implementing Unicode.  The bottom line was basically that the
SJT> recommendations of the Unicode Security Considerations UTR #36 should
SJT> be followed with respect to "characters that may not be what they look
SJT> like".

This is relevant, thanks for the pointer.  See


which links to:


which can also be used to build a table of homoglyphs (as in 


reply via email to

[Prev in Thread] Current Thread [Next in Thread]