[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] PDFPIC macro
From: |
Keith Marshall |
Subject: |
Re: [Groff] PDFPIC macro |
Date: |
Wed, 11 Oct 2017 10:09:24 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 |
On 09/10/17 23:44, Deri James wrote:
> On Mon 09 Oct 2017 09:10:18 Keith Marshall wrote:
>> Perhaps, you could:
>>
>> $ make clean
>> $ make CFLAGS=-DDEBUGGING
>>
>> and check your failing PDFs again, so we can see whatever
>> unexpected token sequence is leading to the "syntax error"; only
>> when we know that, will we have any chance of handling it, before
>> the parser simply gives up on the offending PDF.
>
> Thought I'd better take this off list (it's a bit too "techy"
> perhaps), hope you don't mind.
Actually, I do mind ... and I completely disagree with your reasoning.
Certainly, some list members -- perhaps even a majority -- will not be
interested in the technical details, but there will surely be some who
may be interested, and who may even contribute constructively. Your
arbitrary decision to communicate privately denies *all* list members
the freedom to choose whether they wish to participate, or not, and it
denies *me* potential benefit from Eric Raymond's "many eyes make bugs
shallow" principle.
I will not publish your sample files, without your permission, but
otherwise, this belongs on the list, so I'm taking it back there.
> I ran psbb against the errant pdf with the lex debugging turned on
> and got this:-
>
> address@hidden groff-psbb]$ ./psbb ../pdf
> 20: 18 0 R
> 17: return token PDFROOT (259)
> 17: return token VALUE (260)
> 17: return token VALUE (260)
> 10: return token 'R' (82)
> 20: 19
> 11: return token PDFOBJREF (263)
> 12: pdfseek to offset = 305005
> 13: return token VALUE (260)
> 13: return token VALUE (260)
> 13: pdfseek to offset = 305035
> 14: lookup object #1 @ 305015 within 0..19
> 14: 0000002355 00000 n --> 2355; 0 n
> 14: pdfseek to offset = 2355
> 15: return token VALUE (260)
> 15: return token VALUE (260)
> 16: return token PDFOBJECT (262)
> 16: object: 1; generation = 0
> 17: return token VALUE (260)
> 17: return token VALUE (260)
> 10: return token 'R' (82)
> psbb:t-psbb (t-psbb.cpp):193: syntax error
>
> Now I believe it located the xref section and then found the /Catalog
> (at offset 2355) but does not like something in it.
Right. After seeking to offset 2355, in state 14 (PDFGOXREF), the lexer
switches to state 15 (PDFGETOBJECT), where it reads the signature of the
object at that offset, then in state 16 (PDFSCANOBJECT), it checks that
it has actually found the object it expected (1 0 obj, in this case),
and proceeds to scan the object content. As it does so, it will find a
dictionary, which in the case of this /Catalog object, is expected to
include, at least a "/Type /Catalog" entry, and a "/Pages n n R" entry.
In the case of your PDF, it looks like:
1 0 obj << /Pages 2 0 R
/Type /Catalog
>>
endobj
>From state 16, the lexer passes through state 10 (PDFDICT), switching
to state 17 (PDFREFER) as soon as it encounters the /Pages key, whence
it returns a pair of VALUE tokens to the yacc stack, (which, prior to
this had been empty); control then reverts to state 10, whence the 'R'
token is returned, to complete the indirect reference for the /Pages
object. At this point, yacc throws the "syntax error", because there
is no rule in its grammar, to handle the token sequence:
VALUE VALUE 'R'
Had the "/Type /Catalog" entry preceded the "/Pages 2 0 R" entry, within
the /Catalog object dictionary, then it would have caused the lexer to
return a PDFOBJREF token, *before* the /Pages object reference, yielding
a yacc stack state of:
PDFOBJREF VALUE VALUE 'R'
for which a grammar rule has been specified, so the lexer would have
successfully followed the object reference. However, there is nothing
in the PDF specifications to require the /Type entry to precede the
/Pages, so we need a postfix equivalent rule, to accommodate:
VALUE VALUE 'R' PDFOBJREF
Adding such a rule is sufficient to fix the issue, for all of your
sample PDF files, with two exceptions (see below).
> Unfortunately, my lexer foo is waning, well to be honest it never
> existed!!
>
> The attached archive holds some samples of two types of pdfs, either
> produced by gropdf or produced by cairo software. Inside the two
> subdirectories there are three types of files:-
>
> *-structure.pdf (these illustrate the structure of the pdf with
> similar name)
>
> *.pdf (these are the pdfs to run against psbb)
>
> *.mm (a program called "freemind" can open these files, they also
> illustrate the structure, but you can interactively click to
> open/close object nodes).
>
> In the gropdf directory the gropdf.pdf file is the one having
> problems, and the gs.pdf is the same file after running through
> ghostscript, which psbb handles perfectly. Both files load fine in
> acroread, which can be quite picky when it comes to syntax, although
> the probability is that gropdf is not quite standard enough.
The gropdf.pdf file has the /Catalog object structure I've illustrated
above. I guess passing it through ghostscript reversed the order of the
/Type and /Pages dictionary entries; inspection reveals it to be thus:
1 0 obj
<</Type /Catalog /Pages 3 0 R
/Metadata 23 0 R
>>
endobj
This would have worked anyway, with the original psbb grammar; adding
the additional postfix PDFOBJREF rule makes it work just as well for the
inverted order in the gropdf.pdf /Catalog dictionary. (This inverted
order may, perhaps, seem less logical, but it doesn't violate the PDF
standard, so we do need to accommodate it).
> The cairo directory contains two examples created by inkscape, psbb
> has a big problem with these.
The yacc grammar adjustment also fixes all but two of these: SJP.pdf
and SJP-Whole.pdf seem to confuse psbb, such that having followed object
references through the /Catalog and /Pages object, it gets into an
infinite loop, rescanning the first /Page object ad infinitum; it
appears to be confused by an embedded /Group dictionary, which places
the lexer in a state in which it overruns the "endobj" sentinel, and
reads ahead until it discovers the /Kids reference in the /Pages object,
(which actually appears later in the file than the /Page object to which
it refers), and follows that reference back to the /Page object again,
(and again, and again ...). I have an idea how to fix this too ...
> I hope these are helpful to you, sorry for being a nuisance.
> Integrating pdf bounding boxes into groff would be a big benefit.
> These are the MediaBoxes which would be expected.
Thanks. These are helpful to me, (but obviously not to others, unless
you're willing to distribute them). Regardless, I'll leave the analysis
here, for reference.
> address@hidden Samples]$ pdfbb Cairo/*.pdf gropdf/*.pdf
> Processing 'Cairo/gropdf-pdf-structure.pdf'
> Cairo/gropdf-pdf-structure.pdf: MediaBox: 0,0,842,595
> Processing 'Cairo/gs-pdf-structure.pdf'
> Cairo/gs-pdf-structure.pdf: MediaBox: 0,0,842,595
> Processing 'Cairo/SJP.pdf'
> Cairo/SJP.pdf: MediaBox: 0,0,114.146561,115.235786
> Processing 'Cairo/SJP-structure.pdf'
> Cairo/SJP-structure.pdf: MediaBox: 0,0,842,595
> Processing 'Cairo/SJP-Whole.pdf'
> Cairo/SJP-Whole.pdf: MediaBox: 0,0,210.231384,138.239899
> Processing 'Cairo/SJP-Whole-structure.pdf'
> Cairo/SJP-Whole-structure.pdf: MediaBox: 0,0,842,595
> Processing 'gropdf/gropdf.pdf'
> gropdf/gropdf.pdf: MediaBox: 0,0,612,792
> Processing 'gropdf/gropdf-pdf-structure.pdf'
> gropdf/gropdf-pdf-structure.pdf: MediaBox: 0,0,842,595
> Processing 'gropdf/gs.pdf'
> gropdf/gs.pdf: MediaBox: 0,0,612,792
> Processing 'gropdf/gs-pdf-structure.pdf'
> gropdf/gs-pdf-structure.pdf: MediaBox: 0,0,842,595
--
Regards,
Keith.