[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: java.util.zip speedup
From: |
Artur Biesiadowski |
Subject: |
Re: java.util.zip speedup |
Date: |
Mon, 23 Dec 2002 15:41:18 +0100 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3a) Gecko/20021212 |
John Leuner wrote:
Did you do any profiling of the code?
Yes, many times, but only under sun jre 1.4.1 client hotspot on windows.
Both cpu and memory. For the cpu, I do not think there are any suprises
- most of time is spend in Inflater.decodeHuffman, with
InflaterHuffmanTree.getSymbol contributing the most there (most methods
called from decodeHuffman get inlined probably). Next parts are
Adler32.update and OutputWindow.repeat (last one can be an artifact due
to System.arraycopy happening there, providing good point for gc break).
I have done small optimalizations in few places, mostly in zip entry
reading (it is now about 5 times faster) - but not in actual
decompression routines.
That's quite an improvement. Are you saying that the current version
generates 130MB of garbage, and yours creates 1MB?
Yes. This 1MB is not real 'garbage' - most of it are ZipEntry objects,
which are actually used by test program.
As for 130MB of garbage, most of it comes from OutputWindow allocation.
I'm reading over 3000 files from zip, with 32kb byte[] allocated per
file, it is already around 90MB. Next part comes from
InflaterInputStream (4kb buffer), and quite a few from
InflaterHuffmanTree, where temporary arrays are created each time. There
is also an incredible amount of small arrays created in ZipEntry
creation (for each read, small 2 or 4 byte array is created) - does not
influence byte-count of garbage much, but it means that around 10 arrays
are allocated per entry.
I have done improvements in two major ways.
First important one is changing way zip entries are read - it used to
read header pieces word after word, I read entry at once - _very_ big
improvement at first zip read. I have also removed most of temporary
buffer allocations there.
Second is caching Inflater. Unfortunately, I have had to hide this
caching from world very deep, so there is no chance of somebody messing
with cached Inflater in any way. This basically means that only
InputStream created by ZipFile will gain this benefit - with
ZipInputStream, GZIPInputStream and generic InflaterInputStream we have
almost same situation as before.
Inflater in turn cache OutputWindow and HuffmanTrees, which cache their
temporary arrays. In reset methods I'm doing quite extensive cleaning -
somebody with more zip experience probably will be able to remove some
of these precautions (I was not sure if at some place there is not
dependency on not-touched array element to be zero, so I clear almost
all of them, except obvious cases).
Cache works ok with multiple threads - just if you have more than one of
InputStreams open at same time, second and following ones will produce
garbage (I think that in majority of cases single thread is doing most
of reading from zip).
Could you send me a diff of your changes?
Sure, I'm sending them to you in separate mail. Please note that work is
not finished - no documentation is done, some in fact is outdated. I
would probably also want to move factory methods to separate class.
Bearing in mind that the java.util.zip implementation in Classpath is
also available as a seperate library, we should consider the needs of
people who might want to use this in an embedded system where space is
at a premium.
Three basic choices here.
1) In current state, there is zero problem with making it configurable.
Everything boils down to two 'factory' free methods, which needs to be
no-op for embedded case - so some kind of system or static variable will
do the trick.
2) We can make Inflater cached at ZipFile instead of system level. When
ZipFile is closed, Inflater would be freed - until then it would get
reused for all streams from this ZipFile (as long as they are in single
thread of course). Problem here is with the systems which would like to
keep many ZipFile instances open, 'just in case' - they would pay
multiple cost with not benefit. Variation of it is having reference
counting for open ZipFiles - to have single static Inflater cache, but
empty it when last ZipFile is closed (and recreate if new one is
opened). But still you pay 36kb for having ZipFile open (in addition to
possibly a lot for ZipEntries).
Artur