[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [pdf-devel] Bug in LZW filter?
From: |
Aleksander Morgado |
Subject: |
Re: [pdf-devel] Bug in LZW filter? |
Date: |
Tue, 20 Sep 2011 10:30:12 +0200 |
Hey Georg,
> >> Additional tests showed that encoding with EarlyChange = 0 (very
> >> unusual) needs a bigger table (LZW_MAX_DICTSIZE + 1). I have done a lot
> >> of testing and the LZW encoder (with EarlyChange = 0) now outputs the
> >> same files as PDFlib-Lite-5.0.4p1 does. (PDFlib-Lite-5.0.4p1 is the only
> >> encoder with EarlyChange = 0 I found)
> >>
> >
> > So the changes done will make it work both with EarlyChange=0 and
> > EarlyChange=1, I am assuming.
>
> Yes.
>
Great.
> > Could you maybe try to explain one by one the changes in
> > 'src/base/pdf-stm-f-lzw.c'? They seem pretty straightforward, but I
> > would like to know the reasoning behind each of them. Are they all due
> > to needing a bigger table with EarlyChange=0?
>
> First of all: the standard is very vague. It says:
> "[EarlyChange:] An indication of when to increase the code length. If
> the value of this entry is 0, code length increases shall be postponed
> as long as possible. If the value is 1, code length increases shall
> occur one code early."
>
I think the Wikipedia page for LZW explains it quite well, check the
"Variable-width codes" section in:
http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
> As I could not figure out which code is meant (input, output, or
> dictionary) this remains unclear to me.
The code which is increased is the generated output during encoding; or
the read input during decoding. When encoding, the LZW filter will start
using N bits in the generated code; until the moment where it finds that
it needs to generate N+1 bits: when 'ω+s' is not found in the
dictionary. Once this happens, the encoder can either:
* Emit 'ω' with width N and then increase the size of the code to be
generated in one bit (standard).
* Emit 'ω' with width N+1 directly (EarlyChange)
The decoder needs to know that information while trying to decode the
bytestream, or it won't be able to match properly the boundaries of the
generated codes.
> So I used showpdf(mupdf) to get
> (real world) examples and compared the decoded bytes with vbindiff and
> decoded examples by hand. It shows that the lzw_buffer_inc_bitsize comes
> to early [1].
>
> This fix broke my dec(enc(rand.bin) == rand.bin "test". So I used
> showpdf examples again ... It shows that the reset code is encoded two
> codes to early[2].
>
> I have done this procedure for EarlyChange=0 as well and I found out
> that the dictionary is to small (after looking at the source code of
> pdflib5) [3].
>
> At [4],[5] I increased the numbers because of the increased DICTSIZE (to
> get the same result) But I am not 100% sure about this. Someone with
> experience in LZW maybe double-check this.
>
I will try to get deep on the process and check all this reasoning. That
+1 in the DICTSIZE really seemed to be just a convenience.
> [6] is irrelevant because it belongs to an further bug I will commit
> later. Sorry for that. I will remove it.
>
Ok, thanks for noticing. I will remove that chunk myself from the branch
I'm reviewing.
> > Are we testing these fixes with more than one dataset?
> The unit test is with two test sets (that has at least one dict reset).
> One with EarlyChange == 0, one with EarlyChange == 1;
>
> But I tested LZW decoding with several files by following creators:
> * Acrobat PDFWriter 2.01 for Windows
> * Acrobat PDFWriter 3.0 for Windows
> * Acrobat 3.0 Import Plug-in
> * Acrobat Distiller 2.0 for Windows
> * GPL Ghostscript 8.71
>
> For encoding see next answer.
>
>
> How sure are we
> > that we're not fixing one test case and breaking all the others?
>
> Good question. As the standard remains unclear (to me) I only can do
> testing and looking at other open projects source code. Without my fix I
> am able to show you dozens of bad decoded PDFs. With my fix I cannot.
>
> With the encoder it is much harder. Because the decoder listens to the
> reset code you can reset to early and the decoded result is correct
> anyway. My tests showed that in fact there are different LZW Encoder
> implementations (for example (with EarlyChange = 1): GNU PDF is binary
> same as Ghostscript 8.71 but "Acrobat Distiller 2.0" or "Acrobat
> PDFWriter 2.01" places the reset code later)
>
> I think my tests for the encoder are quite good. I tested (with
> EarlyChange = 1) 3 different PDFs created with Ghostscript 8.71 with a
> 48KB to 50KB LZW-Stream. The decoded bytes (with GNU PDF or pdfshow)
> encode to the origin bytes.
>
For the encoder side, we will possibly always skip EarlyChange, which is
the standard way of doing it.
Cheers,
--
Aleksander
signature.asc
Description: This is a digitally signed message part