[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ghostscript/GhostPDL 9.22 Release Candidate 1

From: Ken Sharp
Subject: Re: Ghostscript/GhostPDL 9.22 Release Candidate 1
Date: Tue, 19 Sep 2017 13:30:04 +0100

At 13:42 19/09/2017 +0200, David Kastrup wrote:

So the mechanisms mostly out of our own control are Ghostscript in its
ps2pdf facility, various TeX engines when including lots of
ps2pdf-generated PDF files into a main document.

To me this is where the problem lies, PDF is good as a terminal document format, and that was its original aim. Its not good as an intermediate format, or for inclusion in more complex documents.

I feel the correct answer to this is not to use PDF as an intermediate format, it seem to me you should stick with a typesetting format because that allows you to determine that fonts which are named the same, are in fact the same, and you don't need to include them multiple times. In fact for a layout format, you wouldn't normally include the actual fonts at all, of course.

  For this use case, we
want a process that avoids excessive font duplication.  The process so
far involved an additional Ghostscript run removing most of the
duplicates from the TeX-generated PDF (someone please correct me if I
got this wrong).

This only works because all the PDF files you are using (so far) embed the whole font, don't use subsets, and use the same Encoding (or use different names so that they are clearly different fonts). Were you to start using PDF files (from whatever source) where that is not the case, and I quoted OpenOffice as an example, then you might run into the problem with the bug you are exploiting.

By not using the PDF object number as a unique identifier, Ghostscript only uses the font name. If you get two different fonts (subset or otherwise) Ghostscript will assume they are the same font. If they are differently encoded (say that 'A' is encoded at position 0x42 in the first font, but 0x42 in the second font has a 'B') then Ghostscript can't tell and will simply drop the second font.

The result of this is that you will get the wrong text in the output PDF file. Again, this isn't a theoretical problem, we have had numerous bug reports on this count which we have done our best to work around. In the end there was no alternative but to use the object number as the unique identifier (NB we actually use the object number and the filename, in case we get two files with the same font using the same object number....)

The only way you find out this has happened is when you carefully read the text, of course.

We don't really have a way to forego Texinfo for our printed manuals.
Given the comparative importance of TeX for document preparation,
however, I think it would be good to figure out how to keep at least one
viable way open of making this work and figure out a migration path of
the involved tools to how you would optimally would want to have things

I don't think that TeX can (or should) preserve object ids when
including external PDF files, so figuring out some other reasonably
robust identity associated with fonts would seem important.

Well I know nothing about TeX. It seems to me however, that it *must* preserve the object IDs in some sense, because otherwise you wouldn't be ending up with multiple copies of fonts. If it didn't preserve the object numbers, then it would assume that the first 'Times' is the same as the second 'Times' and would collapse them into a single reference. Exactly as you are using Ghostscript for at present.

If your PDF files contain ToUnicode CMaps then its possible to identify properly which glyph is actually intended by each character code in each font. Doing that would allow you to optimise the use of fonts, because you could alter the character coding of each usage so that it was consistent across the documents and only required a single instance of the font in question.

I'd have to experiment to find out, but it would nit surprise me to discover that when you include a PDF file in TeX what it actually does is convert it into an EPS or PostScript program and then concatenates all the documents together.

That would mean TeX could use PDF files as a kind of 'black box', and would mean that the fonts would be included multiple times, just as you say is happening.

> PDF was never intended as a means of transferring, or 'containerising'
> content, its not trivial (or even possible in general) to extract
> content from, or simplify, PDF files.

And yet I seem to remember Adobe has a specification for how to write
PDF intended for embedding, haven't they?

Err, no, I don't think so. You can embed files untouched (including PDF files) inside a PDF, just as other file types. But that's not really what I meant when I said 'containerising'.

You can also have PDF Collections (I can't recall if that's the correct name) but again that isn't what I meant when I talk about transferring content, because you aren't transferring the content, you are including the whole thing, not just its content.

I was thinking more like writing a .docx file as an RTF or a spreadhseet as a comma separated file. Transferring the content without the associated container.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]