[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Can watermarking Unicode text using invisible differences sneak thro

From: Eli Zaretskii
Subject: Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
Date: Sun, 06 Feb 2022 10:56:47 +0200

> From: Richard Stallman <rms@gnu.org>
> Cc: psainty@orcon.net.nz, luangruo@yahoo.com,
>       kevin.legouguec@gmail.com, emacs-devel@gnu.org
> Date: Sat, 05 Feb 2022 23:13:37 -0500
>   > I don't understand the specification of these functions.  How would
>   > diacriticize decide/know that ?~ is equivalent to the ?̃ (U+0303
>   > COMBINING TILDE) that is part of ?ã ?
> You know more about Unicode than I do, so I'm sure it is true _in some
> sense_ that "U+0303 (COMBINING TILDE) is part of ?ã".
> But I have doubts that that particular sense is the one that is
> pertinent to the job `diacriticize' is meant to do.
> I think you mean that one can represent the glyph image `ã' in Unicode
> as a composition using a sequence of `a' and COMBINING TILDE.  Please
> tell me if I am mistaken.

You are not mistaken.  The character 'ã' can be "decomposed" into 2
characters, 'a' and COMBINING TILDE.  This is called "canonical
decomposition" in Unicode.

> The ã in this sentence is not a composition.  It is a single
> Unicode character, which is also in Latin-1.  I don't think that
> COMBINING TILDE is "part of it".

It is, in the sense that the original character can be decomposed.

>                                                  But how do you propose
>     to make the leap from ?̃ to ?~ ?
> (defconst unicode-combining-chars-alist '(... (?~ . ?̃ ) ...))

So you mean we should create a database of ASCII characters that
approximate the combining diacriticals?  But if so, how is it better
than having a database of complete characters and their ASCII
equivalents, like we have now in latin1-disp.el?  Your proposal may
make the database smaller (and even that mostly only for Latin
characters), but a database of complete characters makes it easier to
make sure the results are optimal, because you see the original
complete character and the complete equivalent, instead of "composing"
them in your head for all the combinations.

I think reasonable appearance is more important than memory
consumption in this case, and other than that, your proposal just
means replacing one database by another, right?

> However, `ucs-normalize-NFD-string' does not know anything about
> ligatures.  Given the fi ligature, it returns the fi ligature.

You need a different kind of decomposition for that, called
"compatibility decomposition":

  (ucs-normalize-NFKD-string "fi") => "fi"

You can use ucs-normalize-NFKD-string for the job of
ucs-normalize-NFD-string as well:

  (append (ucs-normalize-NFKD-string "ã") nil) => (97 771)

(I used 'append' here to make it evident that the result of the
decomposition is 2 characters, not one, since the Emacs display will
by default combine them into the same glyph as the original non-ASCII
character, and an innocent reader could think the decomposition didn't

reply via email to

[Prev in Thread] Current Thread [Next in Thread]