Re: Ghostscript/GhostPDL 9.22 Release Candidate 1

lilypond-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ghostscript/GhostPDL 9.22 Release Candidate 1

From:	Ken Sharp
Subject:	Re: Ghostscript/GhostPDL 9.22 Release Candidate 1
Date:	Thu, 21 Sep 2017 09:53:54 +0100

At 18:50 20/09/2017 +0200, David Kastrup wrote:

Did you get to see the PostScript files before conversion with pstopdf?
Would being able to generate those differently make a difference?

I'm pretty sure Knut sent me everything, really everything. Not that I canuse it all, but its nice to have the complete set just in case.

The problem (for my idea) is not the generation of the individualPostScript files, or the individual PDF files. However, there is some moreinformation on the process at the end of this mail which is (slightly)illuminating, feel free to skip ahead past this explanation.


---------------------------------------------------------------------------------

What I was hoping to do (and this works for my test cases with simplerfonts) was create the PDF files from the PostScript with only fontreferences, no font data embedded. Then create the final PDF, still with nofont data. Finally run that back through Ghostscript with the fontavailable to it. Then the individual uses of the font would pick up the oneand only font available, referenced from Ghostscript, and embed it.

That would (and does for my tests) create a final PDF file with only oneinstance of the font.

The problem is that supporting non-PostScript fonts from disk asreplacements for PostScript fonts is tricky, it involves a certain amountof guesswork to fill in missing information. Our support for TrueType fontsisn't bad, but OTF fonts (those with CFF outlines) isn't as good. Also, thenature of the font makes the guesswork rather more difficult, since it ismostly a 'symbolic' font.


So basically that won't work, at least as things stand now.
---------------------------------------------------------------------------------

Those 125GB files, I wager, are for one-time printing or further
compression, not for public download from a website.  So the comparison
is not entirely fair.

Well that one's anomalous, certainly, but we do have people passing aroundmulti-gigabyte files for download. Alos, the last game I picked up was20GB, and that was a download only.


But, not important as I think I said.

Now, during the investigation of the files Knut sent me I did notice a fewthings.

From what I understand of the process, the intention is that the entirefont is downloaded with each of the individual EPS files, and then the PDFfile which is created should contain the entire font (I'm fairly suresomeone said this). Then the individual PDF files are merged together inTeX, presumably along with some other text, producing a PDF file wherethere are multiple, identical, full copies of the font. You then takeadvantage of the Ghostscript bug to treat all the copies of the font asbeing the same.


I'm sorry to disappoint you, but that's not what is happening.

If the process were happening as described, then I believe mutool would bequite able to detect the duplicate font streams in the final PDF file andremove them. The reason that doesn't work is because the fonts embedded inthe individual PDF files are not complete, they are subsets. Worse still,they don't have subset prefixes on the font name, so its not even clearthey are subsets.

For example, Knut sent me a bunch of EPS files and the PDF files createdfrom them, called testa-1.eps to teste-1.eps. Looking at the EPS files I see:


%%IncludeResource: ProcSet (FontSetInit)
%%BeginResource: FontSet (Emmentaler-20)
/FontSetInit /ProcSet findresource begin
%%BeginData: 64933 Binary Bytes

The following binary looks the same to me, I haven't bothered to checkprecisely. All the EPS files appear to contain the same data. So I'llassume that's a complete copy of the font. Note the size, just short of 65Kb.


But, looking at the PDF files, I see quite different results.

Testa-1.pdf:

9 0 obj
<<
  /BaseFont /Emmentaler-20
  /FontDescriptor 10 0 R
  /Type /Font
  /FirstChar 7
  /LastChar 176
  /Widths [ 641 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    490 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 424 ]
  /Encoding 18 0 R
  /Subtype /Type1
>>
endobj

10 0 obj
<<
  /Type /FontDescriptor
  /FontName /Emmentaler-20
  /FontBBox [ 0 -635 645 1196 ]
  /Flags 4
  /Ascent 1196
  /CapHeight 1196
  /Descent -635
  /ItalicAngle 0
  /StemV 96
  /MissingWidth 500
  /FontFile3 17 0 R
>>
endobj

17 0 obj
<<
  /Length 9653
  /Subtype /Type1C
>>
stream


Testb-1.pdf

10 0 obj
<<
  /BaseFont /Emmentaler-20
  /FontDescriptor 11 0 R
  /Type /Font
  /FirstChar 7
  /LastChar 176
  /Widths [ 641 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 344 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 424 ]
  /Encoding 20 0 R
  /Subtype /Type1
>>
endobj
11 0 obj
<<
  /Type /FontDescriptor
  /FontName /Emmentaler-20
  /FontBBox [ 0 -635 645 1196 ]
  /Flags 4
  /Ascent 1196
  /CapHeight 1196
  /Descent -635
  /ItalicAngle 0
  /StemV 96
  /MissingWidth 500
  /FontFile3 19 0 R
>>

endobj
19 0 obj
<<
  /Length 9708
  /Subtype /Type1C
>>
stream

As you can see the two font streams (which have been decompressed, sothere's no compression differences) are different lengths, and are bothshorter than the original, *much* shorter. Also, although the FirstChar andLastChar entries in the fonts are the same, the entries in the Widths arrayare different.

In short, the fonts are not complete, they have been subset. And, as Inoted above, even worse is the fact that the font names are not decoratedwith a subset prefix.

Now, the lack of a prefix does mean that, in your particular case, you cantake advantage of the Ghostscript bug which treats fonts with the same nameas being the same.

Actually, even if we hadn't moved to using the PDF object number touniquely identify fonts, your approach was going to have a limitedlifespan. Here's the explanation, which you can skip over if you like, itsnot hugely important.


---------------------------------------------------------------------------------

The reason is that this is precisely the problem we've been working towardssolving for some time. Originally Ghostscript simply used the FontName as aunique identifier (because in PostScript that's how it works). If we sawtwo uses of the same FontName we could be sure they were the same font.

But for PDF that doesn't work. Its entirely possible to have multiple fontswith the same name in PDF, because the PDF file doesn't reference them byname, it references them by object number (it is, in fact, possible to havefonts which don't have a name at all in PDF files).

The upshot of this is that pdfwrite was seeing two different fonts, anderroneously assuming they were the same. That meant we never bothered withthe second font, and so carried on with the first one. The problem is thatif the two fonts had different glyphs at the same position, then the finalPDF output would be wrong. We've had a number of examples of this over theyears. Its never a problem with a single file input, or with input otherthan PDF, but if people passed multiple PDF files as input to pdfwrite,then this could occur. If needs be I can dig up some of the bug reports.

Now we knew a long time back that the way to tackle this was to use theobject numbers, because in PDF these *are* unique, and the input filename(in case two files should have fonts with the same name *and* the sameobject numbers) but that was always going to be a big job, so in theinterim we kept on adding more heuristics to look at the properties of twofonts and decide whether they were the same or not.

Sooner or later that process was going to trip you up, because we would hadadded a heuristic which would have identified your fonts as beingdifferent. Which would have had the same effect as using the PDF objectnumbers does. Not only that, but we really wouldn't have had any option torestore the old behaviour, because, as I've said, its really a bug.

---------------------------------------------------------------------------------


So, what to do.....?

Well, it occurs to me that the *real* problem here is that the fonts in theindividual PDF files are subsets. If they were not, then I believe youcould safely and easily use MuPDF (specifically mutool clean) to remove theduplicate fonts. Or at least, the duplicate FontFile streams, I'm notcertain if the Font and FontDescriptor objects would be possible to removeas well. But that would certainly cover a good portion of the file size,the fonts are running at about 9Kb each, while the Font and FontDescriptorobjects are a few tens of bytes.

So the question then becomes 'why are the fonts subset ?' That's a reallygood question, and the answer is that I don't know. Its possible that thereis a genuine pdfwrite bug here. The piece of information I'm missing is thestep used to create the PDF files from the EPS files, I don't know how youare doing that.

My attempts to replicate the individual PDF files have been entirelyunsuccessful, I get files with three copies of the Emmentaler font embeddedinstead of 1, and none of the three fonts match the ones in the PDF filesKnut supplied.

Hmm, actually, going back to the 9.21 release does produce at least similarbehaviour, whereas the 9.22 release does not. In 9.22 I get three fontsoutput instead of 1. I've no idea why currently, and right at the moment Idon't have time to look.

I'll try and remember to look at it when I am not drowning under support,but it looks like there have been changes in this area unrelated to thePDFDontUseObjectNum bug, and that in itself may mean that your processdoesn't work any more, or works less well.

Ken

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Ghostscript/GhostPDL 9.22 Release Candidate 1, (continued)

Prev by Date: Add alpha transparency to SVG backend (issue 330300043 by address@hidden)
Next by Date: Re: Add alpha transparency to SVG backend (issue 330300043 by address@hidden)
Previous by thread: Re: Ghostscript/GhostPDL 9.22 Release Candidate 1
Next by thread: Re: Ghostscript/GhostPDL 9.22 Release Candidate 1
Index(es):
- Date
- Thread