pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] Bug in LZW filter?


From: Aleksander Morgado
Subject: Re: [pdf-devel] Bug in LZW filter?
Date: Tue, 20 Sep 2011 10:30:12 +0200

Hey Georg,

> >> Additional tests showed that encoding with EarlyChange = 0 (very
> >> unusual) needs a bigger table (LZW_MAX_DICTSIZE + 1). I have done a lot
> >> of testing and the LZW encoder (with EarlyChange = 0) now outputs the
> >> same files as PDFlib-Lite-5.0.4p1 does. (PDFlib-Lite-5.0.4p1 is the only
> >> encoder with EarlyChange = 0 I found)
> >>
> > 
> > So the changes done will make it work both with EarlyChange=0 and
> > EarlyChange=1, I am assuming.
> 
> Yes.
> 

Great.

> > Could you maybe try to explain one by one the changes in
> > 'src/base/pdf-stm-f-lzw.c'? They seem pretty straightforward, but I
> > would like to know the reasoning behind each of them. Are they all due
> > to needing a bigger table with EarlyChange=0?
> 
> First of all: the standard is very vague. It says:
> "[EarlyChange:] An indication of when to increase the code length. If
> the value of this entry is 0, code length increases shall be postponed
> as long as possible. If the value is 1, code length increases shall
> occur one code early."
> 

I think the Wikipedia page for LZW explains it quite well, check the
"Variable-width codes" section in:

http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch


> As I could not figure out which code is meant (input, output, or
> dictionary) this remains unclear to me. 

The code which is increased is the generated output during encoding; or
the read input during decoding. When encoding, the LZW filter will start
using N bits in the generated code; until the moment where it finds that
it needs to generate N+1 bits: when 'ω+s' is not found in the
dictionary. Once this happens, the encoder can either:
 * Emit 'ω' with width N and then increase the size of the code to be
generated in one bit (standard).
 * Emit 'ω' with width N+1 directly (EarlyChange)

The decoder needs to know that information while trying to decode the
bytestream, or it won't be able to match properly the boundaries of the
generated codes.

> So I used showpdf(mupdf) to get
> (real world) examples and compared the decoded bytes with vbindiff and
> decoded examples by hand. It shows that the lzw_buffer_inc_bitsize comes
> to early [1].
> 
> This fix broke my dec(enc(rand.bin) == rand.bin "test". So I used
> showpdf examples again ... It shows that the reset code is encoded two
> codes to early[2].
> 
> I have done this procedure for EarlyChange=0 as well and I found out
> that the dictionary is to small (after looking at the source code of
> pdflib5) [3].
> 
> At [4],[5] I increased the numbers because of the increased DICTSIZE (to
> get the same result) But I am not 100% sure about this. Someone with
> experience in LZW maybe double-check this.
> 

I will try to get deep on the process and check all this reasoning. That
+1 in the DICTSIZE really seemed to be just a convenience.

> [6] is irrelevant because it belongs to an further bug I will commit
> later. Sorry for that. I will remove it.
> 

Ok, thanks for noticing. I will remove that chunk myself from the branch
I'm reviewing.

> > Are we testing these fixes with more than one dataset? 
> The unit test is with two test sets (that has at least one dict reset).
> One with EarlyChange == 0, one with EarlyChange == 1;
> 
> But I tested LZW decoding with several files by following creators:
> * Acrobat PDFWriter 2.01 for Windows
> * Acrobat PDFWriter 3.0 for Windows
> * Acrobat 3.0 Import Plug-in
> * Acrobat Distiller 2.0 for Windows
> * GPL Ghostscript 8.71
> 
> For encoding see next answer.
> 
> 
> How sure are we
> > that we're not fixing one test case and breaking all the others?
> 
> Good question. As the standard remains unclear (to me) I only can do
> testing and looking at other open projects source code. Without my fix I
> am able to show you dozens of bad decoded PDFs. With my fix I cannot.
> 
> With the encoder it is much harder. Because the decoder listens to the
> reset code you can reset to early and the decoded result is correct
> anyway. My tests showed that in fact there are different LZW Encoder
> implementations (for example (with EarlyChange = 1): GNU PDF is binary
> same as Ghostscript 8.71 but "Acrobat Distiller 2.0" or "Acrobat
> PDFWriter 2.01" places the reset code later)
> 
> I think my tests for the encoder are quite good. I tested (with
> EarlyChange = 1) 3 different PDFs created with Ghostscript 8.71 with a
> 48KB to 50KB LZW-Stream. The decoded bytes (with GNU PDF or pdfshow)
> encode to the origin bytes.
> 

For the encoder side, we will possibly always skip EarlyChange, which is
the standard way of doing it.


Cheers,

-- 
Aleksander

Attachment: signature.asc
Description: This is a digitally signed message part


reply via email to

[Prev in Thread] Current Thread [Next in Thread]