bug-gzip
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

patch: gzip compress 26% to 43% faster at 6<=level


From: John Reiser
Subject: patch: gzip compress 26% to 43% faster at 6<=level
Date: Tue, 02 Jul 2013 08:32:23 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7

Using my previous patch set
   http://lists.gnu.org/archive/html/bug-gzip/2013-06/msg00027.html
I tried compressing using direct indexing and separate chaining of
3-strings: HASH_BITS=24, H_SHIFT=8.  This was too slow because
the typical L2 cache of 256KiB or 512KiB suffered from many misses,
including "sweeping".  However, separate chaining did decrease
greatly the number of positions that were examined by longest_match().
This is the leading term of overall run time for compressing.

Thus I changed to direct indexing by 2-strings (pairs) as a stem, and
for each stem value the set of 3rd characters is stored consecutively
as gathered leaves.  When processing a new window, then for each stem
the old leaves are scattered by value into an array, the new 3-chains
for that stem are added, and the new leaves are gathered back into con-
secutive storage.  This gives exact chaining of 3-strings but with nice
cache usage.  The memory cost is an additional 1/2 MiB or so, plus the
requirement for fast indexing of an array of 65,536 'short' elements.
(Sorry, MAXSEG_64K.)

That code often is 30% to 45% faster than previous for typical "sweet spot"
cases: approx. 0.37 < output/input < 0.55.  If the input already is compressed
then tracking the 3rd-character leaf values takes time but gives little
advantage: the 3-chains are all too short.  On the other side, for
many highly compressible inputs even the chains of 3-strings are long.
Chains of 4-strings look desirable, but this is left for future work.
For now, the scan of a new window computes a histogram of the pairs,
and reverts to chaining just the 2-strings (pairs) if all chains are short.
Computing the histogram costs a few percent, but prevents a bad slowdown
when a window already is compressed.

At compression level 6 or 7 often the output is around 1/2000 (0.05%)
larger than the previous result.  At level 8 or 9 the new result often
is very slightly smaller.  Compression level 8 (not 9) often is pleasingly
fast, with output that is essentially the same size as previously.
These results suggest additional tuning of the effective internal
parameters derived from configuration_table[].

Sliding the accounting data can become expensive.  If the compiler
generates a conditional branch for
   (Pos)((m > slide) ? (m - slide) : 0)
then that branch often is unpredictable and can take 15 or more cycles for
many elements.  An equivalent evaluation using sign-extending right shift:
   int d = m - slide;
   (Pos)(d &~ (d >> (2==sizeof(int)?15:4==sizeof(int)?31:63)))
takes a few cycles and avoids conditional branch entirely.
Many current processors can perform "saturating subtract"
of 2, 4, 8, or 16 SIMD elements in one cycle, but the details
are machine dependent.  For example:
   
https://github.com/kaffeemonster/zlib/commit/aa30d206d93755c1f5c287b78dd7d165fb0998a8
Unrolling and pipelining SIMD in assembler can achieve optimality:
each available cache cycle is used and is full-width productive.
The machine-dependent benefit is *several percent* faster compressing,
end-to-end.  Maintainers should step up to handle such a dependency.

Exact 3-chains are expensive for low-effort compression.  (level <= 5)
should use the previous hashing technique.  The code would chose at run time
among several versions of longest_match(): one for level <= 5, one when
all chains in the window are short, one for the anticipated "sweet spot" when
3-chains greatly shorten search, one when too many 3-chains are long.

-- 



Attachment: 0003-chain.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]