[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Request to support CJK fonts in grops
From: |
G. Branden Robinson |
Subject: |
Re: Request to support CJK fonts in grops |
Date: |
Fri, 15 Jul 2022 23:30:33 -0500 |
Hello Tanaka-san,
At 2022-07-02T19:39:44+0900, ttk@t-lab.opal.ne.jp wrote:
> Dear All,
>
> I wrote a set of patches in order to support CJK fonts in PostScript
> outputs.
>
> https://github.com/t-tk/groff-CJK-font
Thank you for doing this work!
> - It enables to support CJK non-embedded fonts in UTF16 encoding by
> grops.
> - It enables to define charset by a range of Unicode code points in
> font definition. It is useful for CJK fonts because the number of
> glyphs with the same metric are very large.
>
> The tested examples are shown in tests/* in the GitHub repository.
> They include typeset of Chinese, Japanese, Korean.
>
> I hope the maintainers to consider accepting my request.
I think your changes are well-worth consideration, but they will take
some effort to integrate with the current state of groff 1.23. There
have been literally thousands of commits since 1.22.4 was released.
https://git.savannah.gnu.org/cgit/groff.git/log/
I have looked over the commits at your GitHub site and I have some
comments on a few of them. Would you prefer to receive that feedback
there, or here on this mailing list? (If the latter, I'll quote the
diffs, of course.)
Let me highlight a few things that give me pause. This is meant as
constructive criticism and not as a suggestion that this work isn't
worth doing, or integrating into groff. Also, I want this much at least
to go to the list so that some of the other experts here can correct me
anywhere I need it. (We have several contributors with knowledge of
PostScript and PDF that is far superior to mine.)
A. I'm not sure about the change that makes the C1 controls valid input
code points.
https://github.com/t-tk/groff-CJK-font/commit/0f84463ecbe584cdbf91156819cd03e7d71d77f1
As I understand it, this shouldn't be necessary: this table is used
to screen out invalid _input_ code points. It is not used to forbid
output code points or glyph indices in fonts. This is an important
distinction because, in the 1-byte code space groff uses for input
processing, nearly all of the invalid input code points are used by
the troff program for internal purposes. That, in turn, is one of
the first things that needs to be changed if groff is ever to
natively accept UTF-8 input. (If I were doing this work, I'd
probably relocate groff's internal-usage code points to the Unicode
Private Use Area.)
For example, in the Times Roman font we use with grops, code point
decimal 130 encodes "guillemotright" as named in the Adobe Glyph
List (AGL). Decimal 130 is _not_ valid groff input; this same value
(octal 0202) to encode the transparent file request.
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h#n46
You might try reverting this patch to see if anything breaks. If my
understanding is correct, and if your input is processed with
preconv (which, as we'll see in the next item, it is) to encode the
many CJK characters it will require, then all your tests should
still pass.
If I'm wrong, then I'd very much like to figure out what I don't
understand!
B. I like the broad test coverage but I'm a little doubtful of its
architecture on two levels.
https://github.com/t-tk/groff-CJK-font/commit/c3be2e29ae362283f442bc7ab99210a58425edbc
First, it is not integrated into the Automake test harness we
otherwise use. You could be forgiven for overlooking that in groff
1.22.4, as I think back then we had only a handful of such automated
tests--maybe 4. But now, we have well over 100. So I would prefer
to see the tests run under the Automake-managed test harness. I've
written most of those 100+ tests so I'm happy to help with the
migration.
Secondly, I think the testing is targeted very broadly. It seems to
use raw groff input and keep entire PostScript files around as test
artifacts to compare against (that is, if I'm understanding what you
have, they are the "expected" output).
Moreover, a latter commit adds PNG conversions of the PostScript
output. I don't see where the output of patched groff is tested for
conformance with these PNGs, but I think doing so would be
misguided, and while visually-inspectable output is of interest,
it's beside the point of your changes. Comparing PNG output is
really a test of whatever is rendering the PostScript fonts, and
that's not groff. It is furthermore sensitive to alternation or
replacement of the font files used--but as long as the fonts
maintain the same encoding--and if "UTF-16" is embedded in their
file names, I reckon this should be the case--then such alterations
to the rendering are not groff's problem to solve.
On the other hand, placement of the glyphs on the page is very
important and solidly groff's responsibility!
What I am getting at is that these tests appear to skip over an
important level of transformation, and does not target the scope of
most of what your patches change with specificity. I am referring
to the transformation of device-independent GNU troff output to
PostScript.
Were I writing tests for this, I would run groff to produce
device-independent output from an input document that exercises the
code paths and Unicode code point ranges of interest (much as you
have already done), but run "groff -Z" to save that
device-independent output, and use _it_ as the test input. This
output format is documented and fairly stable (see the groff_out(5)
man page). Some knowledge of it will be helpful to troubleshoot any
test failures, but I emphasize that it is okay to produce it with
"groff -Z". I've written device-independent *roff output from
scratch but that's only because I'm a paranoid, crazy documentarian
who wants to rewrite the groff_out(5) man page some day soon. ;-)
With the PostScript expected output in hand, I would then not check
for an exact match, but simply for the PostScript commands that
direct the renderer/printer to place the relevant encoded glyph(s)
on the page. This to me would seem to be simply a matter of
verifying that the right PostScript command and font code points are
in place. The basic function could be tested with a groff document
that puts a single glyph on the page.
For instance, adapting your "dankeCJK.1", I might do this.
$ groff -K utf8 -Z > dankeCJK.grout
早
$ cat dankeCJK.grout
x T ps
x res 72000 1 1
x init
x F -
p1
XXX
V24000
H72000
n12000 0
x trailer
V792000
x stop
At about the place "XXX" shows up (this is groff Git HEAD output),
with CJK UTF-16 font support we would instead see commands to mount
a font (like "CSS") and write the glyph, I expect with the "C"
command (documented in groff_out(5).
It seems to me that almost any other difference in "grout" or
PostScript produced by groff would be irrelevant. The scope of your
work is to get groff to recognize and use UTF-16-encoded CJK fonts,
correct? Checking the entire output for exact matches makes it
likely that unrelated changes in groff will cause spurious test
failures.
C. Of lesser importance, and easier to solve, are some reservations I
have about code style. I am dubious about indirecting the changed
data type for `ps_output` through the preprocessor, altering its
first argument from a `char` to an unsigned short. I would prefer
to have the function tell the truth about its argument types.
"CHAR" is misleading. Furthermore, my first inclination would be to
go ahead and use an entire machine register for this datum, since
Unicode is a 21-bit code in a 32-bit space. Only if performance
problems, reliably measured and verified, with grops arose would I
consider an optimization to a shorter type.
More broadly, to get groff more Unicode friendly in the future we're
going to need to widen its internal data type for character and
glyph code points anyway, as discussed in item (A) above. I'd
personally be perfectly happy to see grops become the first step in
such an improvement. I also get the feeling it's going to be more
manageable and less disruptive to change that data type everywhere
else first, _before_ we change the core (GNU troff, the program,
itself).
Another small style issue is that I don't see any reason to hide
this new feature behind C preprocessor conditionals, _except_ for
the fact that you changed the syntax of the font description file.
https://github.com/t-tk/groff-CJK-font/commit/a34710636508177025ba1c1b9ec979ec191aae53
I think I would prefer to add a new font description file directive,
something like "charset-range", and use that to exercise your new
feature of applying a single set of metrics to a large number of
code points. This would better maintain continuity with a file
format and syntax that dates back to Kernighan's device-independent
troff in 1981. I say "continuity" not "compatibility", because a
non-groff *roff that tries to parse a groff font description file
using your new feature would fail to correctly process either form,
but by using a new keyword we make it much more obvious and clear to
anyone troubleshooting problems what has gone wrong--an incompatible
extension is in use. But every troff veteran who encounters groff
is accustomed to those, so it should be no great shock.
(On reflection, maybe "ranged-charset" would be a better name,
because it would avoid misinterpretation by lazy parsers that only
compare the first 7 bytes of the keyword--but maybe the diversity of
*roff font description file readers is so low that this doesn't
really matter.)
It has taken some time to consider these issues and compose this email;
I apologize for making you wait for the feedback.
Can you tell me where I might obtain the font files you are using? I am
eager to start experimenting and planning the least disruptive route to
integration.
> Thanks,
> Takuji TANAKA
Best regards,
Branden
signature.asc
Description: PGP signature