lilypond-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ghostscript/GhostPDL 9.22 Release Candidate 1


From: Ken Sharp
Subject: Re: Ghostscript/GhostPDL 9.22 Release Candidate 1
Date: Thu, 21 Sep 2017 09:53:54 +0100

At 18:50 20/09/2017 +0200, David Kastrup wrote:

Did you get to see the PostScript files before conversion with pstopdf?
Would being able to generate those differently make a difference?

I'm pretty sure Knut sent me everything, really everything. Not that I can use it all, but its nice to have the complete set just in case.

The problem (for my idea) is not the generation of the individual PostScript files, or the individual PDF files. However, there is some more information on the process at the end of this mail which is (slightly) illuminating, feel free to skip ahead past this explanation.

---------------------------------------------------------------------------------
What I was hoping to do (and this works for my test cases with simpler fonts) was create the PDF files from the PostScript with only font references, no font data embedded. Then create the final PDF, still with no font data. Finally run that back through Ghostscript with the font available to it. Then the individual uses of the font would pick up the one and only font available, referenced from Ghostscript, and embed it.

That would (and does for my tests) create a final PDF file with only one instance of the font.

The problem is that supporting non-PostScript fonts from disk as replacements for PostScript fonts is tricky, it involves a certain amount of guesswork to fill in missing information. Our support for TrueType fonts isn't bad, but OTF fonts (those with CFF outlines) isn't as good. Also, the nature of the font makes the guesswork rather more difficult, since it is mostly a 'symbolic' font.

So basically that won't work, at least as things stand now.
---------------------------------------------------------------------------------


Those 125GB files, I wager, are for one-time printing or further
compression, not for public download from a website.  So the comparison
is not entirely fair.

Well that one's anomalous, certainly, but we do have people passing around multi-gigabyte files for download. Alos, the last game I picked up was 20GB, and that was a download only.

But, not important as I think I said.



Now, during the investigation of the files Knut sent me I did notice a few things.

From what I understand of the process, the intention is that the entire font is downloaded with each of the individual EPS files, and then the PDF file which is created should contain the entire font (I'm fairly sure someone said this). Then the individual PDF files are merged together in TeX, presumably along with some other text, producing a PDF file where there are multiple, identical, full copies of the font. You then take advantage of the Ghostscript bug to treat all the copies of the font as being the same.

I'm sorry to disappoint you, but that's not what is happening.

If the process were happening as described, then I believe mutool would be quite able to detect the duplicate font streams in the final PDF file and remove them. The reason that doesn't work is because the fonts embedded in the individual PDF files are not complete, they are subsets. Worse still, they don't have subset prefixes on the font name, so its not even clear they are subsets.

For example, Knut sent me a bunch of EPS files and the PDF files created from them, called testa-1.eps to teste-1.eps. Looking at the EPS files I see:

%%IncludeResource: ProcSet (FontSetInit)
%%BeginResource: FontSet (Emmentaler-20)
/FontSetInit /ProcSet findresource begin
%%BeginData: 64933 Binary Bytes

The following binary looks the same to me, I haven't bothered to check precisely. All the EPS files appear to contain the same data. So I'll assume that's a complete copy of the font. Note the size, just short of 65Kb.

But, looking at the PDF files, I see quite different results.

Testa-1.pdf:

9 0 obj
<<
  /BaseFont /Emmentaler-20
  /FontDescriptor 10 0 R
  /Type /Font
  /FirstChar 7
  /LastChar 176
  /Widths [ 641 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    490 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 424 ]
  /Encoding 18 0 R
  /Subtype /Type1
>>
endobj

10 0 obj
<<
  /Type /FontDescriptor
  /FontName /Emmentaler-20
  /FontBBox [ 0 -635 645 1196 ]
  /Flags 4
  /Ascent 1196
  /CapHeight 1196
  /Descent -635
  /ItalicAngle 0
  /StemV 96
  /MissingWidth 500
  /FontFile3 17 0 R
>>
endobj

17 0 obj
<<
  /Length 9653
  /Subtype /Type1C
>>
stream


Testb-1.pdf

10 0 obj
<<
  /BaseFont /Emmentaler-20
  /FontDescriptor 11 0 R
  /Type /Font
  /FirstChar 7
  /LastChar 176
  /Widths [ 641 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 344 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 424 ]
  /Encoding 20 0 R
  /Subtype /Type1
>>
endobj
11 0 obj
<<
  /Type /FontDescriptor
  /FontName /Emmentaler-20
  /FontBBox [ 0 -635 645 1196 ]
  /Flags 4
  /Ascent 1196
  /CapHeight 1196
  /Descent -635
  /ItalicAngle 0
  /StemV 96
  /MissingWidth 500
  /FontFile3 19 0 R
>>

endobj
19 0 obj
<<
  /Length 9708
  /Subtype /Type1C
>>
stream

As you can see the two font streams (which have been decompressed, so there's no compression differences) are different lengths, and are both shorter than the original, *much* shorter. Also, although the FirstChar and LastChar entries in the fonts are the same, the entries in the Widths array are different.

In short, the fonts are not complete, they have been subset. And, as I noted above, even worse is the fact that the font names are not decorated with a subset prefix.

Now, the lack of a prefix does mean that, in your particular case, you can take advantage of the Ghostscript bug which treats fonts with the same name as being the same.

Actually, even if we hadn't moved to using the PDF object number to uniquely identify fonts, your approach was going to have a limited lifespan. Here's the explanation, which you can skip over if you like, its not hugely important.

---------------------------------------------------------------------------------
The reason is that this is precisely the problem we've been working towards solving for some time. Originally Ghostscript simply used the FontName as a unique identifier (because in PostScript that's how it works). If we saw two uses of the same FontName we could be sure they were the same font.

But for PDF that doesn't work. Its entirely possible to have multiple fonts with the same name in PDF, because the PDF file doesn't reference them by name, it references them by object number (it is, in fact, possible to have fonts which don't have a name at all in PDF files).

The upshot of this is that pdfwrite was seeing two different fonts, and erroneously assuming they were the same. That meant we never bothered with the second font, and so carried on with the first one. The problem is that if the two fonts had different glyphs at the same position, then the final PDF output would be wrong. We've had a number of examples of this over the years. Its never a problem with a single file input, or with input other than PDF, but if people passed multiple PDF files as input to pdfwrite, then this could occur. If needs be I can dig up some of the bug reports.

Now we knew a long time back that the way to tackle this was to use the object numbers, because in PDF these *are* unique, and the input filename (in case two files should have fonts with the same name *and* the same object numbers) but that was always going to be a big job, so in the interim we kept on adding more heuristics to look at the properties of two fonts and decide whether they were the same or not.

Sooner or later that process was going to trip you up, because we would had added a heuristic which would have identified your fonts as being different. Which would have had the same effect as using the PDF object numbers does. Not only that, but we really wouldn't have had any option to restore the old behaviour, because, as I've said, its really a bug.
---------------------------------------------------------------------------------


So, what to do.....?

Well, it occurs to me that the *real* problem here is that the fonts in the individual PDF files are subsets. If they were not, then I believe you could safely and easily use MuPDF (specifically mutool clean) to remove the duplicate fonts. Or at least, the duplicate FontFile streams, I'm not certain if the Font and FontDescriptor objects would be possible to remove as well. But that would certainly cover a good portion of the file size, the fonts are running at about 9Kb each, while the Font and FontDescriptor objects are a few tens of bytes.

So the question then becomes 'why are the fonts subset ?' That's a really good question, and the answer is that I don't know. Its possible that there is a genuine pdfwrite bug here. The piece of information I'm missing is the step used to create the PDF files from the EPS files, I don't know how you are doing that.

My attempts to replicate the individual PDF files have been entirely unsuccessful, I get files with three copies of the Emmentaler font embedded instead of 1, and none of the three fonts match the ones in the PDF files Knut supplied.


Hmm, actually, going back to the 9.21 release does produce at least similar behaviour, whereas the 9.22 release does not. In 9.22 I get three fonts output instead of 1. I've no idea why currently, and right at the moment I don't have time to look.

I'll try and remember to look at it when I am not drowning under support, but it looks like there have been changes in this area unrelated to the PDFDontUseObjectNum bug, and that in itself may mean that your process doesn't work any more, or works less well.



                        Ken




reply via email to

[Prev in Thread] Current Thread [Next in Thread]