help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to translate LaTeX into UTF-8 in Elisp?


From: Héctor Lahoz
Subject: Re: How to translate LaTeX into UTF-8 in Elisp?
Date: Tue, 4 Jul 2017 12:23:48 +0200
User-agent: Mutt/1.5.20 (2009-06-14)

Marcin Borkowski wrote:
> OK, so here is a proof of concept:
> 
> --8<---------------cut here---------------start------------->8---
> (defvar TeX-to-Unicode-accents-alist
>   '((?` . "grave")
>     (?' . "acute")
>     (?^ . "circumflex")
>     (?\" . "diaeresis")
>     (?H . "double acute")
>     (?~ . "tilde")
>     (?c . "with cedilla")
>     (?k . "ogonek")
>     (?= . "macron")
>     (?. . "with dot above")
>     (?u . "with breve")
>     (?v . "with caron"))
>   "A mapping from TeX control characters to accent names used in
> Unicode.")
> 
> (defun combine-letter-diacritical-mark (letter mark)
>   "Return a Unicode string of LETTER combined with MARK.
> MARK can be any character that can be used in TeX accenting
> commands."
>   (let* ((letter (if (stringp letter)
>                      (string-to-char letter)
>                    letter))
>          (uppercase (= letter
>                        (upcase letter))))
>     (cdr (assoc-string
>           (format "LATIN %s LETTER %c %s"
>                   (if uppercase "CAPITAL" "SMALL")
>                   letter
>                   (cdr (assoc mark TeX-to-Unicode-accents-alist)))
>           ucs-names
>           t))))
> --8<---------------cut here---------------end--------------->8---
> 

Great.

Perhaps you could consider translating to unicode combining characters.
I think it is closer to the original TeX idea and could be cleaner:

0300;COMBINING GRAVE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING GRAVE;;;;
0301;COMBINING ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING ACUTE;;;;
0302;COMBINING CIRCUMFLEX ACCENT;Mn;230;NSM;;;;;N;NON-SPACING CIRCUMFLEX;;;;
0303;COMBINING TILDE;Mn;230;NSM;;;;;N;NON-SPACING TILDE;;;;
0304;COMBINING MACRON;Mn;230;NSM;;;;;N;NON-SPACING MACRON;;;;
0305;COMBINING OVERLINE;Mn;230;NSM;;;;;N;NON-SPACING OVERSCORE;;;;
0306;COMBINING BREVE;Mn;230;NSM;;;;;N;NON-SPACING BREVE;;;;
0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING DIAERESIS;;;;
0309;COMBINING HOOK ABOVE;Mn;230;NSM;;;;;N;NON-SPACING HOOK ABOVE;;;;
030A;COMBINING RING ABOVE;Mn;230;NSM;;;;;N;NON-SPACING RING ABOVE;;;;
030B;COMBINING DOUBLE ACUTE ACCENT;Mn;230;NSM;;;;;N;NON-SPACING DOUBLE ACUTE;;;;
030C;COMBINING CARON;Mn;230;NSM;;;;;N;NON-SPACING HACEK;;;;
030D;COMBINING VERTICAL LINE ABOVE;Mn;230;NSM;;;;;N;NON-SPACING VERTICAL LINE 
ABOVE;;;;

See the wikipedia article on unicode equivalence:
https://en.wikipedia.org/wiki/Unicode_equivalence

The difference is that unicode reverses the order. First you have the
base character and then all combining characters. For example, \'a would
be translated to either

00E1;LATIN SMALL LETTER A WITH ACUTE

or

0061;LATIN SMALL LETTER A
0301;COMBINING ACUTE ACCENT

I don't know the implications of using unicode combining characters.
I guess the choice depends on the purpose of the output.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]