[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: "transparent" output and throughput, demystified
From: |
Deri |
Subject: |
Re: "transparent" output and throughput, demystified |
Date: |
Wed, 04 Sep 2024 17:03:09 +0100 |
On Sunday, 1 September 2024 06:09:17 BST G. Branden Robinson wrote:
> Hi Deri,
>
> At 2024-08-31T17:07:28+0100, Deri wrote:
> > On Saturday, 31 August 2024 00:07:57 BST G. Branden Robinson wrote:
> [fixing two of my own typos and one wordo in the following]
>
> > > It would be cleaner and simpler to provide a mechanism for
> > > processing a string directly, discarding escape sequences (like
> > > vertical motions or break points [with or without hyphenation]).
> > > This point is even more emphatic because of the heavy representation
> > > of special characters in known use cases. That is, to "sanitize"
> > > (or "pdfclean") such strings by round-tripping them through a
> > > process that converts a sequence of easily handled bytes like
> > > "\ [ ' a ]" or "\ [ u 0 4 1 1 ]" into a special character node and
> > > then back again seems wasteful and fragile to me.
> >
> > This would be great, but I see some problems with the current code.
> > Doing this:-
> >
> > [derij@pip build (master)]$ echo ".device \[u012F]"|./test-groff -Tpdf
-Z
> > |
> > grep "^x X"
> > x X \[u012F]
> > [derij@pip build (master)]$ echo "\X'\[u012F]'"|test-groff -Tpdf -Z |
grep
> > "^x X"
> > x X \[u0069_032]
> >
> > Shows that the \[u012F] has been decomposed (wrongly!) by \X.
>
> You're raising two issues:
>
> The decomposed Unicode sequence should be:
>
> u0069_0328
>
> not
>
> u0069_032
>
> I 100% agree that that's a bug--thank you for finding it. I'll fix it.
>
> But, is doing the decomposition wrong? I think it's intended.
>
> Here's what our documentation says.
>
> groff_char(7):
>
> Unicode code points can be composed as well; when they are, GNU
> troff requires NFD (Normalization Form D), where all Unicode glyphs
> are maximally decomposed. (Exception: precomposed characters in
> the Latin‐1 supplement described above are also accepted. Do not
> count on this exception remaining in a future GNU troff that
> accepts UTF‐8 input directly.) Thus, GNU troff accepts “caf\['e]”,
> “caf\[e aa]”, and “caf\[u0065_0301]”, as ways to input “café”.
> (Due to its legacy 8‐bit encoding compatibility, at present it also
> accepts “caf\[u00E9]” on ISO Latin‐1 systems.)
Exactly, it says it "can" be composed, not that it must be (this text was
added by you post 1.22.4), in fact most \[uxxxx] input to groff is not
composed (comes from preconv). Troff then performs NFD conversion (so that
it matches a named glyph in the font). Conceptually this is a stream of
named glyphs, there is a second stream of device control text, and this
has nothing to do with fonts or glyph names. Device controls are passed by
.device (and friends). \[u0069_0328] is a named glyph in a font, \[u012F]
is a 7bit ascii representation (provided by preconv) of the unicode code
point.
The groff_char(7) you quote is simply saying that input to groff can be
composited or not. How has that any bearing on how troff talks to its
drivers. If a user actually wants to use a composite character this is
saying you can enter \[u0069_0328] or you can leave it to preconv to use \
[u012F]. Unfortunately the way you intend to change groff, document text
will always use the single glyph (if available) and meta-data will always
use a composite glyph. So there is no real choice for the user.
User facing programs use NFD, since it makes it easier to sort and search
the glyph stream. Neither grops nor gropdf are "user facing", they are
generators of documents which require a viewer or printer to render them,
the only user facing driver is possibly X11. There is a visible difference
between using NFD and using the actual unicode text character when
specifying pdf bookmarks. The attached PDF has screenshots of the bookmark
panel, using \[u0069_0328] NFD and \[u012F] NFC. The example using \
[u012F] is superior (in my opinion) because it is using a single glyph the
font designer intended for that character rather than combining two glyphs
that don't marry up too well.
> Here are the matches in the source, exlcuding some false positives.
>
> $ git grep 012F
> contrib/rfc1345/rfc1345.tmac:.char \[i;] \[u012F] \" LATIN SMALL
LETTER I
> WITH OGONEK font/devhtml/R.proto:u0069_0328 24 0 0x012F
> font/devlj4/generate/text.map:433 012F u0069_0328
> font/devutf8/R.proto:u0069_0328 24 0 0x012F
> src/libs/libgroff/uniuni.cpp: { "012F", "20069_0328" },
> src/utils/afmtodit/afmtodit.tables: "012F", "0069_0328",
> src/utils/afmtodit/afmtodit.tables: "iogonek", "012F",
> src/utils/hpftodit/hpuni.cpp: { "433", "012F", }, // Lowercase I
> Ogonek
>
> The file "uniuni.cpp" is what's of relevance here. It stores a large
> decomposition table that is directly derived from the Unicode
> Consortium's UnicodeData.txt file. (In fact, I just updated that file
> for Unicode 15.1.)
This has no bearing on whether it is sensible to use NFD to send text to
output drivers rather than the actual unicode value of the character.
> > Whilst this might make sense for the text stream since afmtodit keys
> > the glyphs on the decomposed unicode.
>
> Having one canonical decomposition in GNU troff makes _lots_ of things
> easier, I'm sure.
>
> > I would love to know why we decompose,
>
> I don't know. Maybe Werner can speak to the issue: he introduced the
> "uniuni.cpp" file in 2003 and then, in 2005, the "make-uniuni" script
> for regenerating it.
>
> > since none of our fonts include combining diacritical mark glyphs so
> > neither grops nor gropdf have a chance to synthesise the glyphs from
> > the constituent parts if it is not present in the font!
>
> It seems like a good thing to hold onto for the misty future when we get
> TTF/OTF font support.
So, it does not make sense now, but might in the future. I would concede
here if composited glyphs were as good as the single glyph provided in the
font, but the PDF attached shows this is not always true. Also, from TTF/
OTF fonts I've examined, if the font contains combining diacritics it also
contains glyphs for all the base characters which can use a diacritic,
since it is just calls to subroutines with any nnecessary repositioning.
If you know of any fonts which include combining diacritics but don't
provide single glyphs with the base character and the diacritic combined,
please correct me.
> > Given that the purpose of \X is to pass meta-data to output drivers,
>
> ...which _should_ be able to handle NFD if they handle Unicode at all,
> right?
Of course, much better for any sort, which grops/gropdf do not do, and if
they did, would of course change the given text to NFD prior to sorting.
As regards searching, it's a bit of a two edged sword. For example, if the
word "ocksŮ" in a utf8 document is used as a text heading and a bookmark
entry (think .SH "ocksŮ") preconv converts the "Ů" to \[u016E], troff then
applies NFD to match a glyph name in the U-TR font - \[u0055_030A]. When
.device and .output used "copy in" mode the original unicode code point \
[u016E] was passed to the device, but with the recent changes, "new mode
3", to \X are rolled out to the other 7(?) commands which communicate text
to the device drivers, they receive instead \[u0055_030A]. If this
composite code (4 bytes in UTF16) is used as the bookmark text, we have
seen it can produce less optimum results in the bookmark pane but it also
can screw up searching in the pdf viewer. Okular (a pdf viewer) has two
search boxes, one for the text, entering "ocksŮ" here will find the
heading, the second search box is for the bookmarks and entering "ocksŮ"
will fail to find the bookmark since the final character is in fact two
characters. This result may surprise users, that entering exactly the same
keystrokes as they used when writing the document, finds the text in the
document, but fails to find the bookmark.
Then why does it work in the text search, you may ask, since they have
both been passed an NCD composite code. The answer is because in the
grout passed to the driver it becomes "Cu0055_030A" and although this
looks like unicode it is just the name of a glyph in the font, just as
"Caq" in grout will find the "quotesingle" glyph. The font header in the
pdf identifies the postscript name of each glyph used for the document
text and the pdf viewer has a lookup table which converts postscript name
"Uring" to U+016E "Ů" (back where we started).
> > which probably will convert it to utf-8 or utf16, it seems odd to
> > decompose the output from preconv (utf16) before passing to the output
> > driver,
>
> It doesn't seem odd to me. The fact that Unicode has supported at least
> four different normalization forms since, as I recall, Unicode 3.0 (or
> earlier?) suggests to me that there's no one obvious answer to this
> problem of representation.
As I've shown the NCD used in grout (Cuxxxx_xxxx) is simply a key to a
font glyph, this information that this glyph is a composite is entirely
unnecessary for device control text, I need to know the unicode code point
delivered by preconv, so I can deliver that single character back as UTF16
text.
> > .device does not.
>
> The reason is that the request reads its argument in copy mode, and `\X`
> does not.
>
> And, uh, well, you can plan on that going away. Or, at least, for the
> `device` request to align with whatever `\X` does.
>
> > The correct decompose for 012F is 0069_0328, so it is just a string
> > truncation bug.
>
> Yes, I'm happy to fix that.
>
> > Just like you I would like to avoid "round-tripping", utf16 (preconv)
> > -> decomposed (troff) -> utf16 (gropdf).
>
> That's not a good example of a round trip since there is no path back
> from "grout" (device-independent output) to a GNU troff node list or
> token sequence.
I used round-tripping in the general sense that, after processing you end
up back where you started (the same as you used it). Why does groff have
to be involved for something to be considfered a round-trip?
> > This does not currently affect grops which does not support anything
> > beyond 8bit ascii.
>
> I'll be tackling that soonish.
>
> https://savannah.gnu.org/bugs/?62830
>
> > Do you agree it makes more sense for \X to pass \[u012F] rather than
> > \[u0069_0328]?
>
> Not really. As far as I can tell there's no straightforward way to do
> anything different. GNU troff _unconditionally_ runs all simple
> (non-composite) special characters through the `decompose_unicode()`
> function defined in "uniuni.cpp".
Ok, if it can't be done, just leave what you have changed in \X, but leave
.device and .output (plus friends) to the current copy-in mode which seem
to be working fine as they are now, unless you have an example which
demonstrates a problem which your code solves. The only example you gave
of what you are "fixing", the .AUTHOR line in a mom example doc, actually
works fine, so is probanly not a good example to justiify your changes.
> The place it does this is in `token::next()`, a deeply core function
> that handles pretty much every escape sequence in the language.
>
> https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/
input.cpp?h=
> 1.23.0#n2295
>
> To do what you suggest would mean I'd have to add some kind of state to
> the formatter that alters how that function behaves, and I'd have to do
> it in six places, and make sure I unwound it correctly in each case (in
> error pathways as well as valid ones). Why six?
>
> Because all of
>
> \X
> .device
> \!
> .output
> .cf
> .trf
Why are two missing?
> can inject stuff into "grout".
>
> That seems like a perilous path to me.
Not if you restrict the changes to \X only, and document the difference in
behaviour from the other 7 methods.
> I appreciate that the alternative is to hand the output drivers a
> problem labeled "composite Unicode character sequences". I can try my
> hand at trying to write a patch for gropdf(1) if you like. It feels
> like it should be easier than doing so in C++ (which I'll also have to
> do).
It is not a problem I can certainly embed a composite glyph as part of a
bookmark, the problem is that it does not always look very good (see pdf)
and messes up searching for bookmarks.
> At least if the problem is as straightforward as I think it is:
>
> Upon encountering a Unicode-style escape sequence, meaning a byte
> sequence starting `\[u`: [1]
>
> 0. Assert that the next character on the input stream is an uppercase
> hexadecimal digit.
>
> 1. Read a hexadecimal value until a non-hexadecimal character is found.
> Convert that value to whatever encoding the target device requires.
>
> 2. If the next character is `_`, go to 1.
>
> 3. If the next character is `]`, stop.
>
> Would you like me to give this a shot? A PDF reader expects UTF-16LE,
> right?
Have a go if you want, I've got it down to 10 extra lines, but the results
may be depressing (see PDF).
> Regards,
> Branden
>
> [1] The rules are little more complicated for GNU troff itself due to
> support for the escape sequences `\[ul]`, `\[ua]`, and `\[uA]`. But
> as presently implemented, and per my intention, these will never
> appear in "grout"--only Unicode code point identifiers.[2]
>
> [2] And `\[ul]` won't appear even in disguise because it maps to no
> defined Unicode character. But you don't get a diagnostic about it
> because the formatter turns it into a drawing command.
>
> $ printf '\\[ul]\n' | ./build/test-groff -T pdf -ww -Z | grep '^D'
> DFd
> Dl 5000 0
NCDvCopyIn.pdf
Description: Adobe PDF document
- Re: "transparent" output and throughput, demystified, G. Branden Robinson, 2024/09/01
- Re: "transparent" output and throughput, demystified,
Deri <=
- Re: "transparent" output and throughput, demystified, Dave Kemper, 2024/09/04
- Re: "transparent" output and throughput, demystified, G. Branden Robinson, 2024/09/04
- Re: "transparent" output and throughput, demystified, Deri, 2024/09/05
- Re: "transparent" output and throughput, demystified, G. Branden Robinson, 2024/09/06
- Re: "transparent" output and throughput, demystified, Dave Kemper, 2024/09/07