[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: obscure new display features

From: Dave Love
Subject: Re: obscure new display features
Date: Wed, 13 Apr 2005 00:03:18 +0100
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/21.3 (gnu/linux)

Miles Bader <address@hidden> writes:

> The point of similarity is not break-vs-non-break, but that they're
> both characters which can be displayed identically to  some more
> common character in "display contexts", but which should be
> distinguished visually (e.g., with an escape prefix ["\"]) in "editing
> contexts".

That's not what NEWS says, and if developers don't agree on this, you
must expect the rest of us to be confused...

  ** Non-breaking space and hyphens are now prefixed with an escape
  character, unless the new user variable `show-nonbreak-escape' is set
  to nil.

> The crucial point is that the way these characters should be displayed
> differs depending on the context in which they are displayed -- in a
> "display context", like a Gnus *Article* buffer, or a help buffer,
> they can be displayed "natively" (e.g. a NBSP as a simple space), but
> in an "editing context", e.g. a normal emacs buffer, they should be
> displayed in a visually distinct manner.  The reason is that when
> editing, it's important that the user not confuse them with other more
> common characters.

I don't see what's so special about those two characters, especially
compared with spaces v. tabs.  Here's a small selection of likely
homoglyphs (at least to the extent that ­ and - are).  This is just
for ASCII (on the right) and ignores the multiple Emacs iso8859
charsets which have distinct versions of the same character:

  Α A
  А A
  ‚ ,
  ∖ \
  ‹ <
  – -
  − -
  ‐ -
  ‑ -
  ― -
  ─ -
The best way to recognize these, like in URL spoofing attacks, is to
highlight boundaries between characters from different scripts.  Those
can be found from category codes derived from the script table which
is in whatever the `unicode' branch is called now.  Unfortunately,
that doesn't work too well if you have to use only the current regexp
types for, say, font-locking; you have to look for \cX\CX for all X in
the script categories.  It's easier to spot characters not appropriate
in the current locale/language environment, when you don't need to
find boundaries; you probably don't want that in places like mail
headers, though.

Emacs is being quite inconsistent anyway.  Apart from special-casing
iso8859-1 for this new highlighting (as opposed to 8859-N), it
specifically displays eight-bit-graphic characters as graphics to
confuse you.  I nullify the display table to avoid that, but at least
you can figure out that it is the display table being set, unlike with
these new features.  It seems to me they're done in a very un-Emacsy
way that you can't easily find out about and can't customize.  I think
it is reasonable to have a minor mode to change the display table for
this sort of thing.  Displaying tabs temporarily using ␉, for
instance, is an obvious case.

> So far as I know, this is not a distinction that the unicode standard
> makes, but it is one that Emacs needs to worry about.

I don't understand the distinction you're making.  Unicode surely
doesn't expect a format character like U+00AD to disappear in a text
editor where you're editing the source.  (Yudit displays it with one
of those glyphs containing the unicode, and presumably there wouldn't
be an issue if the fonts omitted a glyph for the relevant codepoint.)

The real danger from confusing characters is with the display of
things like internationalized URLs in a `display' context, if you
will, though Emacs doesn't currently deal with them as far as I know.
(See bugtraq & al.)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]