groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "transparent" output and throughput, demystified


From: G. Branden Robinson
Subject: Re: "transparent" output and throughput, demystified
Date: Wed, 4 Sep 2024 22:15:44 -0500

[fair warning: _gigantic_ message, 5.7k words]

Hi Deri & Dave,

I'll quote Dave first since his message was brief and permits me to make
a concession early.

At 2024-09-04T15:05:38-0500, Dave Kemper wrote:
> On Wed, Sep 4, 2024 at 11:04 AM Deri <deri@chuzzlewit.myzen.co.uk>
> wrote:
> > The example using \[u012F] is superior (in my opinion) because it is
> > using a single glyph the font designer intended for that character
> > rather than combining two glyphs that don't marry up too well.
> 
> I agree with this opinion.

Me too.  I can't deny that the pre-composed Ů looks much better than the
constructed one.

> > If you know of any fonts which include combining diacritics but
> > don't provide single glyphs with the base character and the
> > diacritic combined, please correct me.
> 
> My go-to example here is the satirical umlaut over the n in the
> canonical rendering of the band name Spinal Tap.  Combining diacritics
> can form glyphs that no natural language uses, so no font will supply
> a precomposed form.

It does happen, and as a typesetting application I think we _can_ expect
people to try such things.  Ugly rendering is better than no rendering
at all, and it's not our job to make Okular render complex characters
prettily in its navigation pane.

That said, we _can_ throw it a bone, and it seems easy enough to do so.

> > This result may surprise users, that entering exactly the same
> > keystrokes as they used when writing the document, finds the text in
> > the document, but fails to find the bookmark.
> 
> I also agree this is less than ideal.

This, I'm not sure is our fault either.  Shouldn't the CMap be getting
applied to navigation pane text just as to document text?  I confess
that it did not occur to me that one would _want_ to do a full-text
search on the navigation pane contents themselves.  When I search a PDF,
I want to search _the document_.

But that may simply be a failure of my imagination.

At 2024-09-04T17:03:09+0100, Deri wrote:
> On Sunday, 1 September 2024 06:09:17 BST G. Branden Robinson wrote:
> > But, is doing the decomposition wrong?  I think it's intended.
> > 
> > Here's what our documentation says.
> > 
> > groff_char(7):
> > 
> >      Unicode code points can be composed as well; when they are, GNU
> >      troff requires NFD (Normalization Form D), where all Unicode
> >      glyphs are maximally decomposed.  (Exception: precomposed
> >      characters in the Latin‐1 supplement described above are also
> >      accepted.  Do not count on this exception remaining in a future
> >      GNU troff that accepts UTF‐8 input directly.)  Thus, GNU troff
> >      accepts “caf\['e]”, “caf\[e aa]”, and “caf\[u0065_0301]”, as
> >      ways to input “café”.  (Due to its legacy 8‐bit encoding
> >      compatibility, at present it also accepts “caf\[u00E9]” on ISO
> >      Latin‐1 systems.)
> 
> Exactly, it says it "can" be composed, not that it must be (this text
> was added by you post 1.22.4),

...to groff_char(7), yes.  Those lunatics who aren't allergic to GNU
Texinfo would find it familiar from long ago.

commit 3df65a650247b1dd872b7afd4706ebbbfdd93982
Author:     Werner LEMBERG <wl@gnu.org>
AuthorDate: Sun Mar 2 10:10:17 2003 +0000
Commit:     Werner LEMBERG <wl@gnu.org>
CommitDate: Sun Mar 2 10:10:17 2003 +0000

    Document composite glyphs and the `composite' request.

    * man/groff.man, man/groff_diff.man, doc/groff.texinfo: Do it.
[...]
+For simplicity, all Unicode characters which are composites must be
+decomposed maximally (this is normalization form@tie{}D in the Unicode
+standard); for example, @code{u00CA_0301} is not a valid glyph name
+since U+00CA (@sc{latin capital letter e with circumflex}) can be
+further decomposed into U+0045 (@sc{latin capital letter e}) and U+0302
+(@sc{combining circumflex accent}).  @code{u0045_0302_0301} is thus the
+glyph name for U+1EBE, @sc{latin capital letter e with circumflex and
+acute}.
[...]

That's a good 14 years before I wandered in and ruined all our docs.

> in fact most \[uxxxx] input to groff is not composed (comes from
> preconv).

Yes.  Werner wrote preconv, too, so I reckon he made it produce what he
designed GNU troff to consume.[1]

> Troff then performs NFD conversion (so that it matches a named glyph
> in the font).

Fair.  I can concede that the primary purpose of NFD decomposition in
groff is to facilitate straightforward and unambiguous glyph lookup.

> Conceptually this is a stream of named glyphs, there is a second
> stream of device control text,

My conceptualization doesn't seem to quite match yours.  I think of a
grout document as consisting of _one_ stream, the sequence of bytes from
its start to its finish.  There are however multiple interpretation
contexts.  Special character names are one.  't' (and 'u') command
arguments are another (no special characters allowed)...

$ echo "hello, world" | groff -T ascii -Z \
  | sed 's/world/\\[uHAHAHAHA CHECK THIS OUT]/' | grep '^t'
thello,
t\[uHAHAHAHA CHECK THIS OUT]

> and this has nothing to do with fonts or glyph names.

Here I simply must push back.  This depends entirely on what the device
extension does, and we have _no control_ over that.  Excessive
presumptions of such open ended language structures get us into trouble,
as with the question of whether setting the line drawing thickness in a
'\D' escape sequence should move the drawing position.  Coming at the
question ab initio, there's no reason to suppose that it should.  I will
boldly assert that one negative precedent was a blunder of Kernighan's,
assuming that all future drawing commands would exclusively comprise
sequences of coordinate pairs reflecting page motions.

Some geometric objects aren't usefully parameterized that way, as
Kernighan should have realized from his own '\D'c radius' command.
Apart from line thickness, possibilities like configuration of broken
line rendering (a nearly limitless variety of dotted, dashed,
dash-dotted, solid, and, maybe, invisible) should have been obvious even
at the time.

Opinions?  I got 'em.  Anyway, I think it's a bad idea to assume that a
device extension will never have anything to do with fonts or glyph
names.  In fact, GNU troff already assumes that they might, and has done
for over 20 years.

1a153a5268 src/roff/troff/node.cc  (Werner LEMBERG      2002-10-02 17:06:46 
+0000  880) void troff_output_file::start_special(tfont *tf,
 color *gcol, color *fcol,
1a153a5268 src/roff/troff/node.cc  (Werner LEMBERG      2002-10-02 17:06:46 
+0000  881)                                       int no_ini
t_string)
037ff7dfcf src/roff/troff/node.cc  (Werner LEMBERG      2001-01-17 14:17:26 
+0000  882) {
6f6302b0af src/roff/troff/node.cc  (Werner LEMBERG      2002-10-26 12:26:12 
+0000  883)   set_font(tf);
1a153a5268 src/roff/troff/node.cc  (Werner LEMBERG      2002-10-02 17:06:46 
+0000  884)   glyph_color(gcol);
1a153a5268 src/roff/troff/node.cc  (Werner LEMBERG      2002-10-02 17:06:46 
+0000  885)   fill_color(fcol);
6f6302b0af src/roff/troff/node.cc  (Werner LEMBERG      2002-10-26 12:26:12 
+0000  886)   flush_tbuf();
6f6302b0af src/roff/troff/node.cc  (Werner LEMBERG      2002-10-26 12:26:12 
+0000  887)   do_motion();
7ae95d63be src/roff/troff/node.cc  (Werner LEMBERG      2001-04-06 13:03:18 
+0000  888)   if (!no_init_string)
7ae95d63be src/roff/troff/node.cc  (Werner LEMBERG      2001-04-06 13:03:18 
+0000  889)     put("x X ");
037ff7dfcf src/roff/troff/node.cc  (Werner LEMBERG      2001-01-17 14:17:26 
+0000  890) }

A fortiori, the formatter seems to assume that a "special" (device
extension command) will dirty everything about the drawing context that
can possibly be made dirty.  This is causing me considerable grief, as
you've seen in <https://savannah.gnu.org/bugs/?64484>.

> Device controls are passed by .device (and friends).

You'd think, wouldn't you?  And I'd love to endorse that viewpoint.

But I can't, and your own preferences are erecting a barrier to my doing
so.  I'll come back to that with an illustration below.

> \[u0069_0328] is a named glyph in a font, \[u012F] is a 7bit ascii
> representation (provided by preconv) of the unicode code point.

Okay, couple of things: 0x12F does not fit in 7 bits, nor even in 8.

It's precomposed.  It's enough to say that.

Second, \[u0069_0328] is not _solely_ a named glyph in a font.  We use
it that way, yes, and for a good reason as far as I can tell (noted
above), but it is a _general syntax for combining Unicode characters in
the groff language_.  Not only can you use it to express "base
characters" combined with one or more characters to which Unicode
assigns "combining" (non-spacing[2]) semantics, but you can also use it
to express ligatures.  And GNU troff does.  And has, since long before I
got here.

groff_char(7) again:

   Ligatures and digraphs
       Output   Input   Unicode           Notes
       ──────────────────────────────────────────────────────────────────
       ff       \[ff]   u0066_0066        ff ligature +
       fi       \[fi]   u0066_0069        fi ligature +
       fl       \[fl]   u0066_006C        fl ligature +
       ffi      \[Fi]   u0066_0066_0069   ffi ligature +
       ffl      \[Fl]   u0066_0066_006C   ffl ligature +
       Æ        \[AE]   u00C6             AE ligature
       æ        \[ae]   u00E6             ae ligature
       Œ        \[OE]   u0152             OE ligature
       œ        \[oe]   u0153             oe ligature
       IJ        \[IJ]   u0132             IJ digraph
       ij        \[ij]   u0133             ij digraph

How one decomposes such a composite character depends.  Ligatures should
be, and are, broken up and written one-by-one as their constituents, all
base characters.  Accented characters may have to be degraded to the
base character alone; one would certainly not serialize them like a
ligature.  How nai¨ve!  ;-)

> The groff_char(7) you quote is simply saying that input to groff can
> be composited or not.

I don't know about "simply", but yes.

> How has that any bearing on how troff talks to its drivers.

Of itself, it doesn't.  But because the output language, which I call
"grout", affords extension in certain ways, including a general purpose
escape hatch for device capabilities lacking abstraction in the
formatter, it _can_ come up.

As in two of the precise situations that lifted the lid on this infernal
cauldron: the annotation and rendering _outside of a document's text_
of section headings, and document metadata naming authors, who might
foolishly choose to be born to parents that don't feel bound by the
ASCII character set, and as such can appear spattered with diacritics in
an "info" dialog.

If GNU troff is to have any influence over how such things appear, we're
must consider the problem of how to express text there, and preferably
do so in ways that aren't painful for document authors to use.

> If a user actually wants to use a composite character this is saying
> you can enter \[u0069_0328] or you can leave it to preconv to use \
> [u012F]. Unfortunately the way you intend to change groff, document
> text will always use the single glyph (if available)

Eh what?  Where is this implied by anything I've committed or proposed?
(It may not end up mattering given the point I'm conceding.)

> and meta-data will always use a composite glyph.

Strictly, it will always use whatever I get back from certain "libgroff"
functions like.  But I'm willing to flex on that.  Your "Se ocksŮ"
example is persuasive.

Though some irritated Swede is bound to knock us about like tenpins if
we keep deliberately misspelling "också" like that.

> So there is no real choice for the user.

Okay, how about a more pass-through approach when it comes to byte
sequences of the form `\[uxxxx]` (where 'xxxx' is 4 to 6 uppercase
hexadecimal digits)?

I will have to stop using `valid_unicode_code_sequence()` from libgroff.
But that can be done.  And I need multiple validators regardless (or
flags to a common one), as there's no sensible way to handle code points
above U+00FF in file names, shell commands, or terminal messages,
because they all consist of C `const char *` strings (that moreover will
require transformation to C language character escapes--I hope only the
octal sort, though).  For more on this, see my conversation with Dave in
<https://savannah.gnu.org/bugs/?65108>.

> User facing programs use NFD, since it makes it easier to sort and
> search the glyph stream. Neither grops nor gropdf are "user facing",
> they are generators of documents which require a viewer or printer to
> render them, the only user facing driver is possibly X11. There is a
> visible difference between using NFD and using the actual unicode text
> character when specifying pdf bookmarks. The attached PDF has
> screenshots of the bookmark panel, using \[u0069_0328] NFD and
> \[u012F] NFC. The example using \ [u012F] is superior (in my opinion)
> because it is using a single glyph the font designer intended for that
> character rather than combining two glyphs that don't marry up too
> well.

Setting aside the term "user-facing programs", which you and I might
define differently, I find the above argument sound.  (Well, I'm a
_little_ puzzled by how precomposed characters are so valuable for
searching bookmarks since the PDF standard already had the CMap facility
lying right there.)

> This has no bearing on whether it is sensible to use NFD to send text
> to output drivers rather than the actual unicode value of the
> character.

That's vaguely worded.  I assume you mean "text in device extension
commands here".  If so, conceded.

> > It seems like a good thing to hold onto for the misty future when we
> > get TTF/OTF font support.
> 
> So, it does not make sense now, but might in the future.

This isn't a makeweight argument.  We know such font _formats_ exist,
regardless of the repertoires that their specimens have conventionally
supported to date.  I think we'd be wise not to nail this door shut,
even if we don't walk through it today.

> I would concede here if composited glyphs were as good as the single
> glyph provided in the  font, but the PDF attached shows this is not
> always true. Also, from TTF/ OTF fonts I've examined, if the font
> contains combining diacritics it also contains glyphs for all the base
> characters which can use a diacritic, since it is just calls to
> subroutines with any nnecessary repositioning.  If you know of any
> fonts which include combining diacritics but don't provide single
> glyphs with the base character and the diacritic combined, please
> correct me.

I know of none, and I am confident your experience in font perusal and
evaluation is vastly broader than mine.

> > > Given that the purpose of \X is to pass meta-data to output
> > > drivers,

I agree with this earlier statement of yours, but I want to seize on it.
Here's why.  This is going to take a while.

commit f2a92911c552c3995c010f8beb9b89de3612e95a
Author: Deri James <deri@chuzzlewit.myzen.co.uk>
Date:   Thu Mar 1 15:16:11 2018 +0000

    Add page transitions to pdfs created with gropdf.
    
    * src/devices/gropdf.pl: Handle new '\X' commands to support
    page transitions in presentation mode pdfs. These commands are a
    subset of the commands used in present.tmac allowing slideshows
    to be directly produced from -Tpdf without using postscript ->
    gpresent.pl -> ghostscript.
    
    * tmac/pdf.tmac: New macros '.pdfpause' and '.pdftransition' to
    support page transitions.
    
    * src/devices/gropdf.1.man: Document the '\X' commands
    supported.

diff --git a/tmac/pdf.tmac b/tmac/pdf.tmac
index 4a002c37c..350f78391 100644
--- a/tmac/pdf.tmac
+++ b/tmac/pdf.tmac
@@ -18,7 +18,7 @@
[...]
@@ -799,6 +799,12 @@ .de pdfpagename
 .de pdfswitchtopage
 .nop \!x X pdf: switchtopage \\$*
 ..
+.de pdfpause
+.nop \!x X ps: exec %%%%PAUSE
+..
+.de pdftransition
+.nop \!x X pdf: transition \\$1 \\$2 \\$3 \\$4 \\$5 \\$6 \\$7 \\$8
+..
[...]

Now, I don't want to beat you up about this, but your commit message
said you did one thing (handling `\X` commands) and, as I read it, the
code _did_ another.

I'm a bit puzzled by the phrase "Handle new \X commands".  As a macro
file, pdf.tmac can't "handle" `\X` escape sequences in any way[3]--not
as input.  Those are interpreted directly by the formatter.  That's a
minor point.

But neither do they use `\X` themselves!

Recall the foregoing:

> Device controls are passed by .device (and friends).

But they're not!  You don't use them that way!

Why not?

_Because they didn't work!_

`\X` and `.device` get only slight use in pdf.tmac:

.char \[lh] \X'pdf: xrev'\[rh]\X'pdf: xrev'

(Werner put that in.)

.   device pdf: markstart \\n[rst] \\n[rsb] \\n[PDFHREF.LEADING] 
\\*[pdf:href.link]

That's you, following intriguingly but mysteriously by

'   fl

...a non-breaking flush, a thing for whose purpose one would search
groff's documentation for 30 years in vain.

and then

.     device pdf: markend
'     fl

by me (just a tweak to something of yours, probably), and finally

.device pdf: background \\$*

.device pdf: pagenumbering \\$*

...which are both more recent additions, from 2021 and 2023
respectively.

So why the repeated triple axel hacks with "\!x X pdf:"?

I think it's because of that chunk of code I "git blamed" earlier.  In
fact I'll include a bit more because the second version of this
overloaded function goes all the way back to 1991 and is a James Clark
original.

void troff_output_file::start_special(tfont *tf, color *gcol,
                                      color *fcol,
                                      bool omit_command_prefix)
{
  set_font(tf);
  stroke_color(gcol);
  fill_color(fcol);
  flush_tbuf();
  do_motion();
  if (!omit_command_prefix)
    put("x X ");
}

void troff_output_file::start_special()
{
  flush_tbuf();
  do_motion();
  put("x X ");
}

Remember, now, "special" here refers to a device extension command, and
only a laughable naïf would assume it had anything to do with special
characters... :-|

Really, if we banned the words "special" and "transparent" from the
lexicon of all troff developers, we'd make the world a better place.
Whenever you can't think of what to call something, just pick one of
those two words.  Everything will be fine.  >:-(

I swear, all software engineers should be fitted with shock collars.

To get back on track, consider what's going on with the above code.
We've got two ways we can "start [a] special [device extension
command]".  One updates five pieces of state, the other two.  Now, maybe
that's not crazy, but consider what it means.  Your call site determines
which one you get.

There are only a few call sites, and all are within the same file,
"node.cpp".  (That means these functions could and should be marked
`static`,[4] and I will do that after I finish this gargantuan email.)

`special_node::tprint_start()` calls the complex form.

The code handling the `\O[5]` escape sequence calls the simpler one,
along a few different paths based on conditionals.

...and that's it.

Ponder the consequences.  There's no way _within a device extension
command_ (whether by `\X` _or_ `.device`) to tell the formatter which of
the five elements of state need to be updated.  The groff language
doesn't expose this.

In generality, a device extension command _could_ do _anything_, as I
emphasized above, and very old GNU troff code shares that assumption.
Mess with colors?  Maybe!  Change the font?  Could be!  Need to wrap up
a grout extension 't' (or 'u') command for writing a sequence of
ordinary glyphs?  Definitely![5]  Move the drawing position?  It's a
possibility!

Considering these matters led me to realize at long last why GNU troff
output seems to have so many seemingly superfluous cursor motions, and
at least in part, why so many of them are in absolute coordinates (cf.
relative ones) when there seems to be no motivating reason.

Meanwhile, the ultra-specialized `\O5` escape sequence, which our
documentation refuses to explain without accessory garlic and crucifixes
to discourage anyone who isn't developing grohtml to stay far away from
it, knows what it's going to get dirty: namely, not the font and not the
colors, but definitely going to need any pending 't'/'u' command to wrap
up, and certainly going to be moving the drawing position. (`\O5` is the
means by which rasterized images of tbl tables and eqn equations are
inserted into HTML documents produced by groff.)

So `device` and `\X` can give you more than you ask for, more than you
want, and worse, that excess can lead to bad rendering.  And there's NO
WAY in the groff language at present to tell the formatter what kind of
business your device extension command is going to get up to.  When
grohtml had more modest needs, it hunted around and grafted on `\O5`.

That doesn't scale.

But!  What if your device extension command makes _nothing_ dirty, and
requires nothing about rendering state to be aware that it's even there?
Well, then, by golly you can do something mightily clever.

And that is to synthesize your _own_ device control command in the grout
language, 'x', by smuggling it across the border of the formatting
language, thanks to our old friends `\!` ("transparent throughput"--with
astonishing chutzpah the most opaquely named escape sequence in the
language) and the slightly more recent groff-ism, and brother in request
form, `.output`.

And it has worked great for years.

The pièce de résistance, of course, is, having figured out this trick,
to document it nowhere, tell no one, and undertake no effort to attack
to the problem at the formatter language level so that everyone can
benefit.  Some resourceful people might copy it, but it's best left as
an "expert mode" trick kept among the cognoscenti.  Now I don't know who
_exactly_ to blame for this state of affairs; it's better for my blood
pressure not to speculate or research the issue, and even saying as much
as I have is likely to make people mad at me.

But it was not a good call.  It created painful technical debt and we
should fix it.  When we solve a problem with a technique that is weird
or fishy, we should cry out in protest, because we're likely not the
only ones who have struggled with that type of problem.  (It's one of
those odd cases where the fewer people who have to deal with it, the
_worse_ the problem is, until n becomes 1, at which point it bothers no
one else by definition.  But when n reaches two, it's instantly a major
nightmare because relevant knowledge is so scarce and un-socialized.)

Okay.  Popping that 55-gallon burning oil drum off the stack...

> > ...which _should_ be able to handle NFD if they handle Unicode at
> > all, right?
> 
> Of course, much better for any sort, which grops/gropdf  do not do,
> and if they did, would of course change the given text to NFD prior to
> sorting.  As regards searching, it's a bit of a two edged sword. For
> example, if the word "ocksŮ" in a utf8 document is used as a text
> heading and a bookmark entry (think .SH "ocksŮ") preconv converts the
> "Ů" to \[u016E], troff then applies NFD to match a glyph name in the
> U-TR font - \[u0055_030A]. When .device and .output used "copy in"
> mode the original unicode code point \ [u016E] was passed to the
> device, but with the recent changes, "new mode 3", to \X are rolled
> out to the other 7(?) commands which communicate text to the device
> drivers, they receive instead \[u0055_030A]. If this composite code (4
> bytes in UTF16) is used as the bookmark text, we have seen it can
> produce less optimum results in the bookmark pane

As noted above, I am persuaded that I should abandon decomposition of
Unicode special character escape sequences in device extension commands.

> but it also can screw up searching in the pdf viewer. Okular (a pdf
> viewer) has two search boxes, one for the text, entering "ocksŮ" here
> will find the heading, the second search box is for the bookmarks and
> entering "ocksŮ" will fail to find the bookmark since the final
> character is in fact two characters. This result may surprise users,
> that entering exactly the same keystrokes as they used when writing
> the document, finds the text in the document, but fails to find the
> bookmark.

As noted above, _some_ of this seems to me like a deficiency in PDF,
either the standard or the tools.  But, if the aforementioned
abandonment makes the problem less vexing, cool.

> Then why does it work in the text search, you may ask, since they have 
> both been passed an NCD composite code.

What's NCD?  Do you mean NFD?  The former usage persists through the
remainder of your email.

> The answer is because  in the grout passed to the driver it becomes
> "Cu0055_030A" and although this looks like unicode it is just the name
> of a glyph in the font, just as "Caq" in grout will find the
> "quotesingle" glyph. The font header in  the pdf identifies the
> postscript name of each glyph used for the document text and the pdf
> viewer has a lookup table which converts postscript name "Uring" to
> U+016E "Ů" (back where we started).
[...]
> As I've shown the NCD used in grout (Cuxxxx_xxxx) is simply a key to a
> font glyph, this information that this glyph is a composite is
> entirely unnecessary for device control text, I need to know the
> unicode code point  delivered by preconv, so I can deliver that single
> character back as UTF16 text.

Okay.  I'd still like to do _some_ validation of Unicode special
character escape sequences in device extension commands.  I would feel
like a crappy engineer if I permitted GNU troff to hand gropdf the
sequence "x X ps:exec [\[u012Fz] pdfmark".

But gropdf should do validation too.

> I used round-tripping in the general sense that, after processing you
> end up back where you started (the same as you used it). Why does
> groff have to be involved for something to be considfered a
> round-trip? 

I guess we were thinking about the problem in different ways.  I am
pretty deeply concerned about input to and output from the GNU troff
program specifically in this discussion.

> Ok, if it can't be done, just leave what you have changed in \X, but
> leave .device and .output (plus friends) to the current copy-in mode
> which seem to be working fine as they are now,

Here are the coupled pairs as I conceive them.

\X and .device
\! and .output

And then we have .cf and .trf, which are vanishingly little used.  I
need to understand them better, but if `cf` is as laissez-faire as I'm
starting to think it is, we should gate it behind unsafe mode.

I have an ultra-strong preference for making the coupled pairs behave
the same way.  There is substantial precedent for this in GNU troff.

\f and .ft
\s and .ps
\m and .gcolor
\M and .fcolor
\p and .brp

I omit `\v` and `.sp`, since the former cannot spring a trap and the
latter can, and that fact is by deliberate design with well-established
use cases.

I don't see any reason why the coupled pairs above should have different
interpretation rules (beyond those inherent to the syntactical
differences of escape sequences and requests).  Most importantly I want
document and macro package authors to be able to switch between them at
their convenience.  Telling them they need to remember, or look up,
which one reads in copy mode or which one flushes which aspects of grout
state strikes me as an emphatic anti-feature.

> unless you have an example which demonstrates a problem which your
> code solves. The only example you gave of what you are "fixing", the
> .AUTHOR line in a mom example doc, actually works fine, so is probanly
> not a good example to justiify your changes.

The problem I'm trying to solve is that nearly no one seems to
understand how the formatter works in the area under discussion, and the
few who have known or figured it out to date, ain't tellin'.  If they
had cared to, they could have stopped me in my tracks months to years
ago, when I started complaining about this stuff.

That's an undesirable property of a software system.

What we ended up with was in effect if not in intent, "<snort> Go ahead,
documentation guy--just you figure it out."  That's fine.  Challenge
accepted.

In rejoinder to your implicit scold of undertaking pointless efforts,
let me offer the following quotes from familiar personages.

"...I was able to make an initial release of Mom after about three
years.  From the beginning, I followed a self-imposed rule:  Write the
documentation as it would appear in the manual before defining a macro.
These weren't descriptions of what I intended to do, but careful
instructions for using as-yet unwritten macros.  Documenting an
already-written macro can lead to getting all twisted up, but
implementing a macro that has to follow the documentation keeps you on
top of things."

https://technicallywewrite.com/2023/09/30/groffmom

"Most details of the constant questioning and experimentation during the
early period of rapid change are long forgotten, as are hundreds of
transitory states that were recorded in the on-line manual.  From time
to time, however, a snapshot was taken in the form of a new printed
edition.  Quite contrary to commercial practice, where a release is
supposed to mark a stable, shaken-down state of affairs, the very act of
preparing a new edition often caused a flurry of improvements simply to
forestall embarrassing admissions of imperfection."

https://www.cs.dartmouth.edu/~doug/reader.pdf

Documentation, like automated testing, keeps honest engineers honest.

If any aspect of a system is infeasible to describe, either without
wincing at how many caveats and asides one has to make, or altogether,
that aspect bears reconsideration.

Hence this thread.  Still callow and green, I recall asking this list
years ago what the warnings at issue meant.  No one would, or maybe
could, answer me.  I resolved to find my own answers.  I've learned a
tremendous amount.  But some of what I have discovered is less than
exemplary.

So, yeah, that's the problem I'm trying to solve.

> > Because all of
> > 
> >     \X
> >     .device
> >     \!
> >     .output
> >     .cf
> >     .trf
> 
> Why are two missing?

Which two did you have in mind?  If I'm overlooking something, you'd be
doing me a favor in telling me.[6]

> > can inject stuff into "grout".
> > 
> > That seems like a perilous path to me.
> 
> Not if you restrict the changes to \X only, and document the
> difference in behaviour from the other 7 methods.

That's the status quo, but for the reasons I think I have thoroughly
aired above, I think it's a bad one.  Authors of interfaces to
device features that _you'd think_ would suggest the use of the
"device-related" escape sequence and request have avoided them to date
because of the undesirable side effects.

"Yeah, we have >this< for that, but nobody uses it.  Instead we just go
straight to page description assembly language."

Is no one ashamed of this?

> It is not a problem I can certainly embed a composite glyph as part of a
> bookmark, the problem is that it does not always look very good (see pdf) 
> and messes up searching for bookmarks.

For the sake of a thorough reply, I acknowledge again that the
constraint of running all the Unicode special character escape sequences
through the normalization facilities offered by libgroff are unnecessary
here.  I turned to that resource because it was there and I didn't want
to reinvent any wheels.  As we say again and again, DRY.  ;-)

> Have a go if you want, I've got it down to 10 extra lines, but the
> results may be depressing (see PDF).

The good news is that you've shifted me.  I hope I can make `\X` and
`device` language features that you can happily employ to greater effect
in "pdf.tmac".

Thank you for your patience.

Regards,
Branden

[1]

commit e7c9dbd201a241e8c42f34ef09acbc16584f16c3
Author: Werner LEMBERG <wl@gnu.org>
Date:   Fri Dec 30 09:31:50 2005 +0000

    New preprocessor `preconv' to convert input encodings to something
    groff can understand.  Not yet integrated within groff.  Proper
    autoconf stuff is missing too.

    Tomohiro Kubota has written a first draft of this program, and some
    ideas have been reused (while almost no code has been taken
    actually).

    * src/preproc/preconv/preconv.cpp. src/preproc/preconv/Makefile.sub:
    New files.

    * MANIFEST, Makefile.in (CCPROGDIRS), test-groff.in
    (GROFF_BIN_PATH): Add preconv.

commit e9a1d5af572610f8ad80a0c18a0f6b02306fed03
Author: Werner LEMBERG <wl@gnu.org>
Date:   Sun Jan 1 16:31:01 2006 +0000

    * src/preproc/preconv/preconv.cpp (emacs_to_mime): Various
    corrections:
      . Don't map ascii to latin-1.
      . Don't use IBMxxx encodings but cpxxx for portability.
      . Map cp932, cp936, cp949, cp950 to itself.
    (emacs2mime): Protect calls to strcasecmp.
    (conversion_iconv): Add missing call to iconv_close.
    (do_file): Emit error message in case of unsupported encoding.

[and so on]

[2] plus programmable positioning tricks that advanced font file formats
    employ, as I understand it, so you can render pretty Vietnamese,
    among other things

[3] Short of, maybe, writing them into a diversion, which has been done,
    and selectively filtering them based on node identity, for which
    insufficient facilities in the groff language are available to date.
    Historically, we throw the `unformat` and `asciify` requests at such
    diversions and pray that they do what we need.

    You can also "handle" it by it not _being_ an escape sequence in the
    first place.  For instance, by changing or disabling the escape
    character.  But string handling facilities are few in the groff
    language.  As I keep saying, I hope to fix that.

[4] In C, I'm certain of that.  In C++, the fact that they're member
    functions of a class may have some bearing.  Static member functions
    are conceivable, as these need no specialization by object identity.
    Moreover, there is only ever one `troff_output_file` object in
    existence during the lifetime of any GNU troff process anyway.

    My attempt at a minor cleanup might explode in my face anyway.  C++
    is a language meticulously accreted from chewed bubble gum and
    whatever could be methodically swept from the floors of jail houses
    and crack dens, augmented with the glittering chrome of
    revolutionary innovations the occasional hacker from Microsoft or
    Sun wangled in on the force of his boundless ambition to get
    promoted up the Principal Engineer/Distinguished Engineer/Fellow
    ladder.

[5] It seems that `flush_tbuf()` is the only thing that really needs to
    be unconditional.  It refers to the buffer of ordinary characters
    being assembled into a 't' or 'u' grout command.  This is an aspect
    of formatter state, not document state, and the casual commingling
    of these matters is yet another frustration.

    Concretely, if we've got a 't' command in progress when we hit a
    device extension command, we _have_ to finish that 't' command.

    t fooba
    x X pdf: 12 double chocolate chip /bakecookies
    c r

    t foobax X pdf: 12 double chocolate chip /bakecookies

    ...would be riotously wrong.

[6] If `\?` is one of them--inapplicable.  It's explicitly prevented
    from bubbling its argument out of the top-level diversion to grout.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]