groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "transparent" output and throughput, demystified


From: Deri
Subject: Re: "transparent" output and throughput, demystified
Date: Thu, 05 Sep 2024 20:31:55 +0100

On Thursday, 5 September 2024 04:15:44 BST G. Branden Robinson wrote:
> [fair warning: _gigantic_ message, 5.7k words]


> Hi Deri & Dave,

Hi Branden,

> I'll quote Dave first since his message was brief and permits me to make
> a concession early.
> 
> At 2024-09-04T15:05:38-0500, Dave Kemper wrote:
> > On Wed, Sep 4, 2024 at 11:04 AM Deri <deri@chuzzlewit.myzen.co.uk>
> > 
> > wrote:
> > > The example using \[u012F] is superior (in my opinion) because it is
> > > using a single glyph the font designer intended for that character
> > > rather than combining two glyphs that don't marry up too well.
> > 
> > I agree with this opinion.
> 
> Me too.  I can't deny that the pre-composed Ů looks much better than the
> constructed one.
> 
> > > If you know of any fonts which include combining diacritics but
> > > don't provide single glyphs with the base character and the
> > > diacritic combined, please correct me.
> > 
> > My go-to example here is the satirical umlaut over the n in the
> > canonical rendering of the band name Spinal Tap.  Combining diacritics
> > can form glyphs that no natural language uses, so no font will supply
> > a precomposed form.

That is the purpose of combining diacritics, to "invent" a glyph for a 
character which does not exist in a font but if a glyph does exist for the 
composed character it seems bizarre for troff to require the composited glyph 
be used rather than the purposely defined glyph. The difference with pdf meta-
data (which uses UTF16) is that it is not restricted to whatever fonts are in  
the pdf, for rendering, it uses system fonts, so even though a particular 
character may be missing from the font in the pdf, it is extremely likely to 
be available within the system fonts.
 
> It does happen, and as a typesetting application I think we _can_ expect
> people to try such things.  

Of course, and if a particular character combination has no unicode code 
point, then compositing in both the document and meta-data is the only option, 
but if a code point for the character combination does exist then this should 
be passed to the device driver because that gives the system fonts a chance to 
use the custom glyph.

> Ugly rendering is better than no rendering
> at all, and it's not our job to make Okular render complex characters
> prettily in its navigation pane.

But. this is a retrograde step, passing glyphs to device drivers in NCD form 
did not occur when text was passed to device drivers in copy-in mode, so we 
are forcing Okular (and all other pdf viewers I tried) to render the given 
text in a sub-standard way - which we did not before.
 
> That said, we _can_ throw it a bone, and it seems easy enough to do so.
> 
> > > This result may surprise users, that entering exactly the same
> > > keystrokes as they used when writing the document, finds the text in
> > > the document, but fails to find the bookmark.
> > 
> > I also agree this is less than ideal.
> 
> This, I'm not sure is our fault either.  Shouldn't the CMap be getting
> applied to navigation pane text just as to document text?  I confess
> that it did not occur to me that one would _want_ to do a full-text
> search on the navigation pane contents themselves.  When I search a PDF,
> I want to search _the document_.

CMap applies to the fonts used in the document text, not meta-data rendered in 
the system fonts.

> But that may simply be a failure of my imagination.
> 
> At 2024-09-04T17:03:09+0100, Deri wrote:
> > On Sunday, 1 September 2024 06:09:17 BST G. Branden Robinson wrote:
> > > But, is doing the decomposition wrong?  I think it's intended.
> > > 
> > > Here's what our documentation says.
> > > 
> > > groff_char(7):
> > >      Unicode code points can be composed as well; when they are, GNU
> > >      troff requires NFD (Normalization Form D), where all Unicode
> > >      glyphs are maximally decomposed.  (Exception: precomposed
> > >      characters in the Latin‐1 supplement described above are also
> > >      accepted.  Do not count on this exception remaining in a future
> > >      GNU troff that accepts UTF‐8 input directly.)  Thus, GNU troff
> > >      accepts “caf\['e]”, “caf\[e aa]”, and “caf\[u0065_0301]”, as
> > >      ways to input “café”.  (Due to its legacy 8‐bit encoding
> > >      compatibility, at present it also accepts “caf\[u00E9]” on ISO
> > >      Latin‐1 systems.)
> > 
> > Exactly, it says it "can" be composed, not that it must be (this text
> > was added by you post 1.22.4),
> 
> ...to groff_char(7), yes.  Those lunatics who aren't allergic to GNU
> Texinfo would find it familiar from long ago.
> 
> commit 3df65a650247b1dd872b7afd4706ebbbfdd93982
> Author:     Werner LEMBERG <wl@gnu.org>
> AuthorDate: Sun Mar 2 10:10:17 2003 +0000
> Commit:     Werner LEMBERG <wl@gnu.org>
> CommitDate: Sun Mar 2 10:10:17 2003 +0000
> 
>     Document composite glyphs and the `composite' request.
> 
>     * man/groff.man, man/groff_diff.man, doc/groff.texinfo: Do it.
> [...]
> +For simplicity, all Unicode characters which are composites must be
> +decomposed maximally (this is normalization form@tie{}D in the Unicode
> +standard); for example, @code{u00CA_0301} is not a valid glyph name
> +since U+00CA (@sc{latin capital letter e with circumflex}) can be
> +further decomposed into U+0045 (@sc{latin capital letter e}) and U+0302
> +(@sc{combining circumflex accent}).  @code{u0045_0302_0301} is thus the
> +glyph name for U+1EBE, @sc{latin capital letter e with circumflex and
> +acute}.
> [...]

I thought this simply meant if you input to groff a composite character it 
must be maximally decomposed. The para before deals with input of non 
composite unicode. So this is saying both the groffist representation of 
unicode code points (uXXXX[X[X]] and ‘u’ component1 ‘_’ component2 ‘_’ 
component3 ...) are both acceptable as input. 

I don't understand why you consider this wording means you have to output NCD 
to output drivers, i.e. \[uXXXX_XXXX] rather than the \[uXXXX] originally 
provided by preconv.

> That's a good 14 years before I wandered in and ruined all our docs.
> 
> > in fact most \[uxxxx] input to groff is not composed (comes from
> > preconv).
> 
> Yes.  Werner wrote preconv, too, so I reckon he made it produce what he
> designed GNU troff to consume.[1]

Yes, agreed, I admire Werner's work.

> > Troff then performs NFD conversion (so that it matches a named glyph
> > in the font).
> 
> Fair.  I can concede that the primary purpose of NFD decomposition in
> groff is to facilitate straightforward and unambiguous glyph lookup.

Great.

> > Conceptually this is a stream of named glyphs, there is a second
> > stream of device control text,
> 
> My conceptualization doesn't seem to quite match yours.  I think of a
> grout document as consisting of _one_ stream, the sequence of bytes from
> its start to its finish.  There are however multiple interpretation
> contexts.  Special character names are one.  't' (and 'u') command
> arguments are another (no special characters allowed)...
> 

I see. Of course any file can be considered as _one_ stream, a sequence of 
bytes, but I don't think that helps us much!

Perhaps I should elucidate. The grout contents can either affect the contents 
of the page or the "container" (this is meta-data). For grodvi 
\X'papersize=...' affects the window size when displayed in a dvi viewer, it 
has no effect on the contents. The PDFMARK extensions for grops affect meta-
data of the pdf when distilled. There is no way of telling from grout whether 
a particular "x X " command affects the document rendition or the container, a 
shame. 

The 't' and 'u' commands are identical to 'C' followed by 'h', they are 
strings of single character glyph names to be found in the current font, not 
text. so "tHello" is  the equivalent of:-

CH
h (width of H)
Ce
h (width of e)
etc..

'u' is similar except the 'h' commands take into account the value given to 
'u'.
  
> $ echo "hello, world" | groff -T ascii -Z \
> 
>   | sed 's/world/\\[uHAHAHAHA CHECK THIS OUT]/' | grep '^t'
> 
> thello,
> t\[uHAHAHAHA CHECK THIS OUT]
> 
> > and this has nothing to do with fonts or glyph names.
> 
> Here I simply must push back.  This depends entirely on what the device
> extension does, and we have _no control_ over that.

Are you using "we" here to mean "groff developers"? Can't be true, since we 
can change the device drivers, and I can particularly change gropdf. Were you 
using "we" in a more regal mode?

> Excessive
> presumptions of such open ended language structures get us into trouble,
> as with the question of whether setting the line drawing thickness in a
> '\D' escape sequence should move the drawing position.  Coming at the
> question ab initio, there's no reason to suppose that it should.  I will
> boldly assert that one negative precedent was a blunder of Kernighan's,
> assuming that all future drawing commands would exclusively comprise
> sequences of coordinate pairs reflecting page motions.
> 
> Some geometric objects aren't usefully parameterized that way, as
> Kernighan should have realized from his own '\D'c radius' command.
> Apart from line thickness, possibilities like configuration of broken
> line rendering (a nearly limitless variety of dotted, dashed,
> dash-dotted, solid, and, maybe, invisible) should have been obvious even
> at the time.
> 
> Opinions?  I got 'em.  Anyway, I think it's a bad idea to assume that a
> device extension will never have anything to do with fonts or glyph
> names.  In fact, GNU troff already assumes that they might, and has done
> for over 20 years.
> 
> 1a153a5268 src/roff/troff/node.cc  (Werner LEMBERG      2002-10-02 17:06:46
> +0000  880) void troff_output_file::start_special(tfont *tf, color *gcol,
> color *fcol,
> 1a153a5268 src/roff/troff/node.cc  (Werner LEMBERG      2002-10-02 17:06:46
> +0000  881)                                       int no_ini t_string)
> 037ff7dfcf src/roff/troff/node.cc  (Werner LEMBERG      2001-01-17 14:17:26
> +0000  882) { 6f6302b0af src/roff/troff/node.cc  (Werner LEMBERG     
> 2002-10-26 12:26:12 +0000  883)   set_font(tf); 1a153a5268
> src/roff/troff/node.cc  (Werner LEMBERG      2002-10-02 17:06:46 +0000 
> 884)   glyph_color(gcol); 1a153a5268 src/roff/troff/node.cc  (Werner
> LEMBERG      2002-10-02 17:06:46 +0000  885)   fill_color(fcol); 6f6302b0af
> src/roff/troff/node.cc  (Werner LEMBERG      2002-10-26 12:26:12 +0000 
> 886)   flush_tbuf(); 6f6302b0af src/roff/troff/node.cc  (Werner LEMBERG    
>  2002-10-26 12:26:12 +0000  887)   do_motion(); 7ae95d63be
> src/roff/troff/node.cc  (Werner LEMBERG      2001-04-06 13:03:18 +0000 
> 888)   if (!no_init_string) 7ae95d63be src/roff/troff/node.cc  (Werner
> LEMBERG      2001-04-06 13:03:18 +0000  889)     put("x X "); 037ff7dfcf
> src/roff/troff/node.cc  (Werner LEMBERG      2001-01-17 14:17:26 +0000 
> 890) }
> 
> A fortiori, the formatter seems to assume that a "special" (device
> extension command) will dirty everything about the drawing context that
> can possibly be made dirty.  This is causing me considerable grief, as
> you've seen in <https://savannah.gnu.org/bugs/?64484>.

Isn't it just good practice to reset to a known state after making an external 
call, in assembler you push the registers on the stack before making a call to 
a third-party subroutine, and restore afterwards. 
> 
> > Device controls are passed by .device (and friends).
> 
> You'd think, wouldn't you?  And I'd love to endorse that viewpoint.
> 
> But I can't, and your own preferences are erecting a barrier to my doing
> so.  I'll come back to that with an illustration below.
> 
> > \[u0069_0328] is a named glyph in a font, \[u012F] is a 7bit ascii
> > representation (provided by preconv) of the unicode code point.
> 
> Okay, couple of things: 0x12F does not fit in 7 bits, nor even in 8.

But the text "\[u012F]" does.

> It's precomposed.  It's enough to say that.
> 
> Second, \[u0069_0328] is not _solely_ a named glyph in a font.  We use
> it that way, yes, and for a good reason as far as I can tell (noted
> above), but it is a _general syntax for combining Unicode characters in
> the groff language_.  Not only can you use it to express "base
> characters" combined with one or more characters to which Unicode
> assigns "combining" (non-spacing[2]) semantics, but you can also use it
> to express ligatures.  And GNU troff does.  And has, since long before I
> got here.

What happens when you use \[fi] in text. TR font says:-

fi      556,683 2       140     fi      --      FB01

So the postscript name is also "fi" and the postscript to unicode mapping says 
this is codepoint 0xFB01, nowhere does it ever become u0066_0069 within the 
entire code path.

Even if you ask troff to composite it to what is shown below  \[u0066_0069], 
troff sensibly changes this to Cfi in grout, so back where we started

> groff_char(7) again:
> 
>    Ligatures and digraphs
>        Output   Input   Unicode           Notes
>        ──────────────────────────────────────────────────────────────────
>        ff       \[ff]   u0066_0066        ff ligature +
>        fi       \[fi]   u0066_0069        fi ligature +
>        fl       \[fl]   u0066_006C        fl ligature +
>        ffi      \[Fi]   u0066_0066_0069   ffi ligature +
>        ffl      \[Fl]   u0066_0066_006C   ffl ligature +
>        Æ        \[AE]   u00C6             AE ligature
>        æ        \[ae]   u00E6             ae ligature
>        Œ        \[OE]   u0152             OE ligature
>        œ        \[oe]   u0153             oe ligature
>        IJ        \[IJ]   u0132             IJ digraph
>        ij        \[ij]   u0133             ij digraph
> 
> How one decomposes such a composite character depends.  Ligatures should
> be, and are, broken up and written one-by-one as their constituents, all
> base characters.  Accented characters may have to be degraded to the
> base character alone; one would certainly not serialize them like a
> ligature.  How nai¨ve!  ;-)
> 
> > The groff_char(7) you quote is simply saying that input to groff can
> > be composited or not.
> 
> I don't know about "simply", but yes.

Well I understood it, so must be simple.

> > How has that any bearing on how troff talks to its drivers.
> 
> Of itself, it doesn't.  But because the output language, which I call
> "grout", affords extension in certain ways, including a general purpose
> escape hatch for device capabilities lacking abstraction in the
> formatter, it _can_ come up.
> 
> As in two of the precise situations that lifted the lid on this infernal
> cauldron: the annotation and rendering _outside of a document's text_
> of section headings, and document metadata naming authors, who might
> foolishly choose to be born to parents that don't feel bound by the
> ASCII character set, and as such can appear spattered with diacritics in
> an "info" dialog.

mon_premier_doc.mom (in mom/examples) was authored by one such unfortunate:-

.AUTHOR "Cicéron"

The "info" dialog looks fine to me, what do you see wrong?

Even Dave's 
> 
> If GNU troff is to have any influence over how such things appear, we're
> must consider the problem of how to express text there, and preferably
> do so in ways that aren't painful for document authors to use.
> 
> > If a user actually wants to use a composite character this is saying
> > you can enter \[u0069_0328] or you can leave it to preconv to use \
> > [u012F]. Unfortunately the way you intend to change groff, document
> > text will always use the single glyph (if available)
> 
> Eh what?  Where is this implied by anything I've committed or proposed?
> (It may not end up mattering given the point I'm conceding.)

Of course if you split a sentence, you can make it look stupid and exclaim "Eh 
what?" even though the change you have made to groff affects the second half 
of the sentence - the delivery of text to output drivers. Nice trick!
> 
> > and meta-data will always use a composite glyph.
> 
> Strictly, it will always use whatever I get back from certain "libgroff"
> functions like.  But I'm willing to flex on that.  Your "Se ocksŮ"
> example is persuasive.

Yes, but it takes so much spazzy effort to finally persuade you, I wish the 
penny dropped a bit quicker.
 
> Though some irritated Swede is bound to knock us about like tenpins if
> we keep deliberately misspelling "också" like that.

I profusely apologise, it was entirely for demonstration.

> > So there is no real choice for the user.
> 
> Okay, how about a more pass-through approach when it comes to byte
> sequences of the form `\[uxxxx]` (where 'xxxx' is 4 to 6 uppercase
> hexadecimal digits)?

Yes please, this is what I receive now via preconv and copy-in mode. You may 
pass a composite \[uXXXX_XXXX] when the actual user input is in that form 
(which preconv would never output, but if Dave wants to use use Spin̈al Tap 
This seems to work:-

printf ".ft TINOR\n.ps 18\nSpin\h'-5p'\[u0308]\h'+5p'al Tap\n.pdfbookmark 1 
Spi\[u006E_0308]al Tap"|test-groff -Tpdf -ms > Spin̈alTap.pdf

(only using my development version of gropdf - still testing). PDF attached.

> I will have to stop using `valid_unicode_code_sequence()` from libgroff.
> But that can be done.  And I need multiple validators regardless (or
> flags to a common one), as there's no sensible way to handle code points
> above U+00FF in file names, shell commands, or terminal messages,
> because they all consist of C `const char *` strings (that moreover will
> require transformation to C language character escapes--I hope only the
> octal sort, though).  For more on this, see my conversation with Dave in
> <https://savannah.gnu.org/bugs/?65108>.
> 
> > User facing programs use NFD, since it makes it easier to sort and
> > search the glyph stream. Neither grops nor gropdf are "user facing",
> > they are generators of documents which require a viewer or printer to
> > render them, the only user facing driver is possibly X11. There is a
> > visible difference between using NFD and using the actual unicode text
> > character when specifying pdf bookmarks. The attached PDF has
> > screenshots of the bookmark panel, using \[u0069_0328] NFD and
> > \[u012F] NFC. The example using \ [u012F] is superior (in my opinion)
> > because it is using a single glyph the font designer intended for that
> > character rather than combining two glyphs that don't marry up too
> > well.
> 
> Setting aside the term "user-facing programs", which you and I might
> define differently, I find the above argument sound.  (Well, I'm a
> _little_ puzzled by how precomposed characters are so valuable for
> searching bookmarks since the PDF standard already had the CMap facility
> lying right there.)

I've explained that, just like any application, all application text 
(including the bookmarks panel, info dialog, menus etc, are handled by the 
desktop windowing system (GTK, QT ...) only the canvas upon  which pages are 
rendered has access to the CMap.

> > This has no bearing on whether it is sensible to use NFD to send text
> > to output drivers rather than the actual unicode value of the
> > character.
> 
> That's vaguely worded.  I assume you mean "text in device extension
> commands here".  If so, conceded.
> 
> > > It seems like a good thing to hold onto for the misty future when we
> > > get TTF/OTF font support.
> > 
> > So, it does not make sense now, but might in the future.
> 
> This isn't a makeweight argument.  We know such font _formats_ exist,
> regardless of the repertoires that their specimens have conventionally
> supported to date.  I think we'd be wise not to nail this door shut,
> even if we don't walk through it today.
> 
> > I would concede here if composited glyphs were as good as the single
> > glyph provided in the  font, but the PDF attached shows this is not
> > always true. Also, from TTF/ OTF fonts I've examined, if the font
> > contains combining diacritics it also contains glyphs for all the base
> > characters which can use a diacritic, since it is just calls to
> > subroutines with any nnecessary repositioning.  If you know of any
> > fonts which include combining diacritics but don't provide single
> > glyphs with the base character and the diacritic combined, please
> > correct me.
> 
> I know of none, and I am confident your experience in font perusal and
> evaluation is vastly broader than mine.
> 
> > > > Given that the purpose of \X is to pass meta-data to output
> > > > drivers,
> 
> I agree with this earlier statement of yours, but I want to seize on it.
> Here's why.  This is going to take a while.

[Snipped a lot here - did not seem to have much to do whether it was sensible 
to use NFD communicating with device drivers - most of it was thinking out 
loud probably]

> > > ...which _should_ be able to handle NFD if they handle Unicode at
> > > all, right?
> > 
> > Of course, much better for any sort, which grops/gropdf  do not do,
> > and if they did, would of course change the given text to NFD prior to
> > sorting.  As regards searching, it's a bit of a two edged sword. For
> > example, if the word "ocksŮ" in a utf8 document is used as a text
> > heading and a bookmark entry (think .SH "ocksŮ") preconv converts the
> > "Ů" to \[u016E], troff then applies NFD to match a glyph name in the
> > U-TR font - \[u0055_030A]. When .device and .output used "copy in"
> > mode the original unicode code point \ [u016E] was passed to the
> > device, but with the recent changes, "new mode 3", to \X are rolled
> > out to the other 7(?) commands which communicate text to the device
> > drivers, they receive instead \[u0055_030A]. If this composite code (4
> > bytes in UTF16) is used as the bookmark text, we have seen it can
> > produce less optimum results in the bookmark pane
> 
> As noted above, I am persuaded that I should abandon decomposition of
> Unicode special character escape sequences in device extension commands.
> 
> > but it also can screw up searching in the pdf viewer. Okular (a pdf
> > viewer) has two search boxes, one for the text, entering "ocksŮ" here
> > will find the heading, the second search box is for the bookmarks and
> > entering "ocksŮ" will fail to find the bookmark since the final
> > character is in fact two characters. This result may surprise users,
> > that entering exactly the same keystrokes as they used when writing
> > the document, finds the text in the document, but fails to find the
> > bookmark.
> 
> As noted above, _some_ of this seems to me like a deficiency in PDF,
> either the standard or the tools.  But, if the aforementioned
> abandonment makes the problem less vexing, cool.

See explanation above.

> > Then why does it work in the text search, you may ask, since they have
> > both been passed an NCD composite code.
> 
> What's NCD?  Do you mean NFD?  The former usage persists through the
> remainder of your email.

Of course, I was starting to flag.

> > The answer is because  in the grout passed to the driver it becomes
> > "Cu0055_030A" and although this looks like unicode it is just the name
> > of a glyph in the font, just as "Caq" in grout will find the
> > "quotesingle" glyph. The font header in  the pdf identifies the
> > postscript name of each glyph used for the document text and the pdf
> > viewer has a lookup table which converts postscript name "Uring" to
> > U+016E "Ů" (back where we started).
> 
> [...]
> 
> > As I've shown the NCD used in grout (Cuxxxx_xxxx) is simply a key to a
> > font glyph, this information that this glyph is a composite is
> > entirely unnecessary for device control text, I need to know the
> > unicode code point  delivered by preconv, so I can deliver that single
> > character back as UTF16 text.
> 
> Okay.  I'd still like to do _some_ validation of Unicode special
> character escape sequences in device extension commands.  I would feel
> like a crappy engineer if I permitted GNU troff to hand gropdf the
> sequence "x X ps:exec [\[u012Fz] pdfmark".
> 
> But gropdf should do validation too.
> 
> > I used round-tripping in the general sense that, after processing you
> > end up back where you started (the same as you used it). Why does
> > groff have to be involved for something to be considfered a
> > round-trip?
> 
> I guess we were thinking about the problem in different ways.  I am
> pretty deeply concerned about input to and output from the GNU troff
> program specifically in this discussion.

Wood/Trees.

> > Ok, if it can't be done, just leave what you have changed in \X, but
> > leave .device and .output (plus friends) to the current copy-in mode
> > which seem to be working fine as they are now,

At some point you did tell me it had to be NFD because you were using the 
routine which generated the glyph names (which are NFD).

> Here are the coupled pairs as I conceive them.
> 
> \X and .device
> \! and .output
> 
> And then we have .cf and .trf, which are vanishingly little used.  I
> need to understand them better, but if `cf` is as laissez-faire as I'm
> starting to think it is, we should gate it behind unsafe mode.
[snip again]
 
> > > Because all of
> > > 
> > >   \X
> > >   .device
> > >   \!
> > >   .output
> > >   .cf
> > >   .trf
> > 
> > Why are two missing?
> 
> Which two did you have in mind?  If I'm overlooking something, you'd be
> doing me a favor in telling me.[6]

Don't say I never do you a favour. :-)

\Y
.devicem

If you intend to include these in the changes you have made to \X you had 
better talk to Tadziu who often uses these (i.e. yesterday) since the problem 
described in https://savannah.gnu.org/bugs/?66165#comment0 would stop the 
postscript snippet posted yesterday from working since you are changing "-" on 
input to \[u2010], so text such as "-1" become "\[u2010]1" which a postscript 
interpreter will not understand.

> > > can inject stuff into "grout".
> > > 
> > > That seems like a perilous path to me.
> > 
> > Not if you restrict the changes to \X only, and document the
> > difference in behaviour from the other 7 methods.
> 
> That's the status quo, but for the reasons I think I have thoroughly
> aired above, I think it's a bad one.  Authors of interfaces to
> device features that _you'd think_ would suggest the use of the
> "device-related" escape sequence and request have avoided them to date
> because of the undesirable side effects.
> 
> "Yeah, we have >this< for that, but nobody uses it.  Instead we just go
> straight to page description assembly language."
> 
> Is no one ashamed of this?

I might be ashamed if I understood what you were talking about, and knew who 
you were quoting ("Yeah, ...").

> > It is not a problem I can certainly embed a composite glyph as part of a
> > bookmark, the problem is that it does not always look very good (see pdf)
> > and messes up searching for bookmarks.
> 
> For the sake of a thorough reply, I acknowledge again that the
> constraint of running all the Unicode special character escape sequences
> through the normalization facilities offered by libgroff are unnecessary
> here.  I turned to that resource because it was there and I didn't want
> to reinvent any wheels.  As we say again and again, DRY.  ;-)
> 
> > Have a go if you want, I've got it down to 10 extra lines, but the
> > results may be depressing (see PDF).
> 
> The good news is that you've shifted me.  I hope I can make `\X` and
> `device` language features that you can happily employ to greater effect
> in "pdf.tmac".
> 
> Thank you for your patience.

As I read this email I thought "The pith of this email is that Branden agrees 
that passing groff unicode characters (\[uXXXX]) in NFD format is sub optimal, 
and he will revisit the code to rectify.". It could have been almost as short 
as Dave's but it wasn't. :-(

Cheers

Deri

See you in bug #66155.

> Regards,
> Branden
> 
> [1]
> 
> commit e7c9dbd201a241e8c42f34ef09acbc16584f16c3
> Author: Werner LEMBERG <wl@gnu.org>
> Date:   Fri Dec 30 09:31:50 2005 +0000
> 
>     New preprocessor `preconv' to convert input encodings to something
>     groff can understand.  Not yet integrated within groff.  Proper
>     autoconf stuff is missing too.
> 
>     Tomohiro Kubota has written a first draft of this program, and some
>     ideas have been reused (while almost no code has been taken
>     actually).
> 
>     * src/preproc/preconv/preconv.cpp. src/preproc/preconv/Makefile.sub:
>     New files.
> 
>     * MANIFEST, Makefile.in (CCPROGDIRS), test-groff.in
>     (GROFF_BIN_PATH): Add preconv.
> 
> commit e9a1d5af572610f8ad80a0c18a0f6b02306fed03
> Author: Werner LEMBERG <wl@gnu.org>
> Date:   Sun Jan 1 16:31:01 2006 +0000
> 
>     * src/preproc/preconv/preconv.cpp (emacs_to_mime): Various
>     corrections:
>       . Don't map ascii to latin-1.
>       . Don't use IBMxxx encodings but cpxxx for portability.
>       . Map cp932, cp936, cp949, cp950 to itself.
>     (emacs2mime): Protect calls to strcasecmp.
>     (conversion_iconv): Add missing call to iconv_close.
>     (do_file): Emit error message in case of unsupported encoding.
> 
> [and so on]
> 
> [2] plus programmable positioning tricks that advanced font file formats
>     employ, as I understand it, so you can render pretty Vietnamese,
>     among other things
> 
> [3] Short of, maybe, writing them into a diversion, which has been done,
>     and selectively filtering them based on node identity, for which
>     insufficient facilities in the groff language are available to date.
>     Historically, we throw the `unformat` and `asciify` requests at such
>     diversions and pray that they do what we need.
> 
>     You can also "handle" it by it not _being_ an escape sequence in the
>     first place.  For instance, by changing or disabling the escape
>     character.  But string handling facilities are few in the groff
>     language.  As I keep saying, I hope to fix that.
> 
> [4] In C, I'm certain of that.  In C++, the fact that they're member
>     functions of a class may have some bearing.  Static member functions
>     are conceivable, as these need no specialization by object identity.
>     Moreover, there is only ever one `troff_output_file` object in
>     existence during the lifetime of any GNU troff process anyway.
> 
>     My attempt at a minor cleanup might explode in my face anyway.  C++
>     is a language meticulously accreted from chewed bubble gum and
>     whatever could be methodically swept from the floors of jail houses
>     and crack dens, augmented with the glittering chrome of
>     revolutionary innovations the occasional hacker from Microsoft or
>     Sun wangled in on the force of his boundless ambition to get
>     promoted up the Principal Engineer/Distinguished Engineer/Fellow
>     ladder.
> 
> [5] It seems that `flush_tbuf()` is the only thing that really needs to
>     be unconditional.  It refers to the buffer of ordinary characters
>     being assembled into a 't' or 'u' grout command.  This is an aspect
>     of formatter state, not document state, and the casual commingling
>     of these matters is yet another frustration.
> 
>     Concretely, if we've got a 't' command in progress when we hit a
>     device extension command, we _have_ to finish that 't' command.
> 
>     t fooba
>     x X pdf: 12 double chocolate chip /bakecookies
>     c r
> 
>     t foobax X pdf: 12 double chocolate chip /bakecookies
> 
>     ...would be riotously wrong.
> 
> [6] If `\?` is one of them--inapplicable.  It's explicitly prevented
>     from bubbling its argument out of the top-level diversion to grout.

Attachment: pdf3hRFL3SRGI.pdf
Description: Adobe PDF document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]