emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Can watermarking Unicode text using invisible differences sneak thr


From: Eli Zaretskii
Subject: Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
Date: Wed, 19 Jan 2022 10:20:07 +0200

> From: Richard Stallman <rms@gnu.org>
> Date: Tue, 18 Jan 2022 23:15:59 -0500
> 
>    Unicode allows user tracking by means of invisible text marking. Any
>    string can be converted into its binary form and then recoded into a
>    string of zero-width characters, which can then be invisibly inserted
>    into the text. If the text is posted elsewhere, the zero-width
>    character string can be extracted and the process reversed to figure
>    out the identity of the person who copied it.
> 
> which seems ot be about a special case of confusables, and it makes me
> wonder whether Emacs does, or could, show users when Unicode confusion
> occurs, or prevent or fix it somehow.

AFAIU, there's no confusion here, "just" injection of hidden
information into plain text.  "Confusion" is when the user is
presented with some text that looks like something else.  Here the
problematic part is not presented at all.

> First, is that issue of invisible characters real?

Yes.  The idea is to use 2 "normal" characters to serve as binary zero
and binary one, which would then allow you to inject hidden text by
combinations of these two.  Of course, the technique is very
inefficient and will need many such characters to inject any
meaningful text.

> Second, does Emacs do anything now such that these tricks
> won't succeed?

Emacs by default displays ZWJ and ZWNJ characters (and any other
zero-width characters) as thin 1-pixel spaces on GUI frames, and as
simple spaces on TTY frames.  So Emacs users are likely to see these
"hidden" sequences of characters on display.

> If the problem exists in Emacs now, could we prevent it?  I see a few
> ways to try.  I don't know whether they would work well.
> 
> * Indicate the different encodings on the screen somehow.
> 
> * Canonicalize such seqences (perhaps when reading text into Emacs),
> so that different encodings of the same text become identical.
> 
> * Use a stand-alone canonicalizer program.

I don't think I understand your proposals.  They seem to be based on
some idea that these characters are "encodings" of something, and that
this encoding can be "canonicalized"?  If so, I think this
interpretation is a mistake: there's no encoding going on here.  These
zero-width characters' role is to help the text-shaping engine to
shape the characters around them correctly, according to the rules of
the script of those surrounding characters.  When those zero-width
characters are used for the purpose of hiding text, they appear as
sequences of zero-width characters without any reason, and in
particular the characters that surround them are likely to be
whitespace characters, which don't need any joiners to shape them.
The job of a feature that detects this is to discern between these two
use cases, and flag the suspicious one.

In any case, I don't think these solutions could work by examining
single characters.  ZWJ and ZWNJ are important characters in some
scripts, so we cannot mangle them based on considering isolated
characters.  We must consider sequences of such characters when we
design a feature that makes them stand out, because only on that level
we can distinguish between legitimate uses of those characters and
suspicious uses.

I think we should introduce a minor mode that detects those sequences
and makes them stand out on display, with or without some warning
message in the echo-area.  People who want to be aware of any such
potentially hidden text will turn that on.  We could also turn it on
automatically in email and eww.  Patches are welcome; I believe we
already have the infrastructure in the new textsec.el package.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]