[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Broken `if big5-p` code in titdic-cnv.el (was: Scan of broken condit

From: Eli Zaretskii
Subject: Re: Broken `if big5-p` code in titdic-cnv.el (was: Scan of broken conditional forms)
Date: Wed, 27 Jan 2021 18:16:28 +0200

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Kenichi Handa <handa@m17n.org>, Eli Zaretskii <eliz@gnu.org>,
>   mattiase@acm.org,  emacs-devel@gnu.org
> Date: Tue, 26 Jan 2021 22:02:35 -0500
> So, I think using `iso-2022-jp` is a bad idea here: it gives the
> illusion that the two branches are different where they really aren't.
> If we do want to recover the difference (the one we presumably lost in
> Emacs-23), we need to make those two branches return
> properly-propertized strings with something like:
>     (defun tsang-quick-converter (dicbuf tsang-p big5-p)
>       (let* ((charset (if big5-p 'chinese-big5-1 'chinese-cns11643-1))
>              (fulltitle
>               (propertize (if tsang-p "倉頡" "簡易")
>                           'charset charset))
> Tho I'm not sure even that would be sufficient, since that function
> generates a file so if it just prints those strings into an Elisp file,
> the info would again be lost, at least when that Elisp file
> gets compiled.
> Given that we lived blissfully unaware of the problem for the last 10
> years (plus another year with some vague awareness of it but still
> without doing anything about it), I suggest we get rid of the `if
> big5-p` tests and switch the file to `utf-8`.

I've discussed this with Handa-san a year ago, and we arrived at the
conclusion that the charset information is indeed no longer important.

However, if you look carefully at the part of tsang-quick-converter
that begins with

    (let ((punctuation '((";" ";﹔,、﹐﹑" ";﹔,、﹐﹑")

and ends with

    (dolist (elt punctuation)
      (insert (format "(%S %S)\n" (concat "z" (car elt))
                      (if big5-p (nth 1 elt) (nth 2 elt))))))

you will see that some of the characters in the punctuation structure
are actually different between the big5-p and non-big5-p branches,
although most of them are identical.  So either these are artifacts of
converting this file from its original encoding, or there are actual
differences between these two branches, and we cannot simply delete
one of them.

This puzzle has been sitting in my TODO since I discovered these
differences a year ago.  If you (or someone else) are willing to
unlock the mystery and simplify the file accordingly, that would be
welcome indeed.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]