bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20140: 24.4; M17n shaper output rejected


From: Richard Wordingham
Subject: bug#20140: 24.4; M17n shaper output rejected
Date: Sun, 13 Feb 2022 20:53:10 +0000

On Sun, 13 Feb 2022 18:04:11 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sat, 5 Feb 2022 22:52:51 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> > Cc: Lars Ingebrigtsen <larsi@gnus.org>, 20140@debbugs.gnu.org
> > 
> > You're welcome to include my composition rules.  
> 
> Thanks.  I started with your code:
> 
> > (defvar tai-tham-composable-pattern
> >   (let ((table
> >      ;; C is letters, independent vowels, digits, punctuation
> > and symbols. '(("C" .
> > "[\u1A20-\u1A54\u1A80-\u1A89\u1A90-\u1A99\u1AA0-\u1AAD]") ("M" .
> > "[\u1A55-\u1A57\u1A59-\u1A5E\u1A61-\u1A7C\u1A7F]"); Mark ("H" .
> > "\u1A60") ; sakot ("S" . "[\u1A75-\u1A7C]") ; Marks commuting with
> > sakot ("N" . "\u1A58"))) ; mai kang lai
> >     (basic_syllable "C\\(N*\\(M\\|HS*C\\)\\)*")
> >         (regexp "X\\(N\\(X\\)?\\)*H?")) ; X is basic syllable
> >     (let ((case-fold-search nil))
> >       (setq regexp (replace-regexp-in-string "X" basic_syllable
> > regexp t t)) (dolist (elt table)
> >     (setq regexp (replace-regexp-in-string (car elt) (cdr elt)
> >                                            regexp t t))))
> >     regexp))
> > 
> > (let ((elt (list (vector tai-tham-composable-pattern 0
> > 'font-shape-gstring) (vector "." 0 'font-shape-gstring)
> >              )))
> >   (set-char-table-range composition-function-table '(#x1A20 .
> > #x1AAD) elt))  
> 
> But that didn't seem to work well enough: e.g., some marks in your
> "sample text" didn't combine with letters, as I think they should.

Which ones?  Are you sure they didn't combine at the Emacs level?
I did suspect the problem was writing '\u1A7C' instead of
'\u1a7c', but I'm no longer so sure.  (The 'C' might get expanded, but
I'm beginning to think not.)

> Then I tried this simplistic setting:
> 
>   (set-char-table-range composition-function-table
>                       '(#x1a20 . #x1aaf)
>                       (list (vector "[\u1a20-\u1aaf]+" 0
> 'font-shape-gstring)))
> 
> and it worked much better, including passing a small number of the
> tests from your renderer test page that I threw on Emacs.  This is on
> MS-Windows with Emacs 29 and HarfBuzz 2.4.0 (which is not even the
> latest release of HarfBuzz), and with the A Tai Tham KH New V3 font.

> Any reason not to use the above simple setup for Tai Tham text
> composition?

Mostly only that you would have to edit the text with "autocomposition
at point disabled" or mark word boundaries, e.g. with U+200B ZERO WIDTH
SPACE. The Tai languages that use Tai Tham use scriptio continua.  While
modern Pali does separate words with visible white space, its words
tend to be polysyllabic; with discerning composition, it would be about
as tolerable as editing Hindi in Devanagari with autocomposition
enabled. (Quite a few people edit Devanagari in transliteration to
Latin!)

You should also add CGJ and ZWNJ, and some people may appreciate ZWJ -
the Khottabun font has ligatures involving ZWJ, though it may just be
an experimental feature - and ultimately WJ, for when someone writes a
Tai Tham word breaker. Oh, and Thai and Lao mai t(r)i and mai
chat(t)awa and U+0324 COMBINING DIAERESIS BELOW turn up occasionally -
U+0324 is supported in Thep's Khottabun font, and my Da Lekh series
supports Thai mai tri and mai chattawa. These characters seem to work
with HarfBuzz.

If using the native Windows renderer is an option with Emacs, then 'A
Tai Tham KH New' works better than 'A Tai Tham KH New V3'.  I've
created https://wrdingham.co.uk/lanna/font_test.htm to do _font_
comparisons.  I'd delayed because I've only recently satisfied myself
that it is lawful, at least under English law.  (The qualms were
with the samples taken from books.)  It's still very much a work in
progress.

> I needed a couple more additions to Emacs to make Tai Tham support
> work OOTB: for example, script-representative-chars lacked an entry
> for Tai Tham, and the default fontset needed an addition.  (And on
> MS-Windows, one needs to run the w32-find-non-USB-fonts magic once, to
> notice the newly installed Tai Tham font.)

> Other than that, assuming the above setting of
> composition-function-table is okay, we are ready to officially add Tai
> Tham to scripts supported by Emacs.

> Btw, is there a way to get all the examples from your
> https://wrdingham.co.uk/lanna/renderer_test.htm as a UTF-8 encoded
> text file?  I'd like to test the Emacs rendering with all of the
> examples, but copy-pasting each example separately from the browser is
> not my idea of useful time investment.  So if you could provide the
> examples as a downloadable text file, I'd appreciate.

As buried (you're not the only one to have overlooked it) in the
penultimate paragraph of 'Content and Layout' section, "The test words
may, in principle, be extracted quite simply from this web page. Each
test 'word' is the content of the first cell in each row whose class is
tst1. For convenience*, I have extracted the first two cells in such
rows, along with titles, to a CSV file."  The file is rt.csv in the
same directory.  I included the meaning and pronunciation as those who
don't know the script may find it easier to refer to the words by
translation or transcription.  You may prefer to use the file more or
less as it is, but one can easily knock up an Emacs macro sequence to
delete the first comma and the rest of the line.  I left the
section titles in for easier navigation to the renderer test file.

*Some people claim to find XML files easy to use, they should then be
able to analyse a file conforming to HTML4 syntax.

Dodgy spellings go in pink rows whose class is 'tst2'.  The alternative
encodings demanded by the USE go in orange rows whose class is 'tst3'.
I have not extracted these.

Richard.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]