Compression tool based on Rabin chunks

bug-gzip

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Compression tool based on Rabin chunks - integrate into gzip?

From:	Bobby Martin
Subject:	Compression tool based on Rabin chunks - integrate into gzip?
Date:	Wed, 24 Feb 2010 16:35:23 -0600

*New lossless compression tool*
I have written a lossless compression tool based on Rabin chunks that,
when used in conjunction with gzip, produces files that are between 8%
and 33% smaller.  The median amount saved in my sample files is about
11%.

*Integration into gzip?*
I would love to integrate this into gzip after an appropriate period
of testing.  This is an exploratory email to see if this is possible,
and introduce the tool for interested parties if it's not.

I'm not sure what the mandate is on gzip - what are the restrictions
on the algorithm(s) used, time performance, memory performance, and
temporary files.

If gzip is mandated to only use the LZ77 algorithm it currently uses,
or gzip is mandated to always be backwards compatible so any file
compressed with gzip must be expandable by previous versions of gzip,
then obviously rabin cannot be integrated into gzip.  In that case you
can treat the rest of this email as an announcement of a secondary
compression tool that works well with gzip.

Otherwise, I'd like to discuss the possibility of integrating the
tool.  I don't have strict memory or time performance statistics - all
I can do now is give my casual observations on some large files.  The
response to this email will determine how much effort I spend in the
immediate future gathering statistics.


*Accessing the tool*
I have started a github project for rabin-tool at
http://github.com/wurp/rabin-tool/  You can download, make and play
with it from there.  See the README.  You can run rabin with no
arguments to see all the possible arguments.


*Some performance and efficacy tests*
As an example, I took a tar of the first 100mb of /usr/bin on my
Debian box.  Gzipped, it’s about 31 meg.  If I run it through rabin
first, it gzips down to about 27 meg – slightly better than 10%
smaller than without using my tool.

I also built a 4mb tar file of the java source for another project;
the same treatment (rabin, then gzip) resulted in a file 33% smaller
than gzip alone.  I built an 8mb tar file of a wordpress html tree -
for this the final result was only 8% smaller than gzip alone.

I have found no case so far in which applying rabin resulted in a
larger file than the original, although I'm sure that would be the
case for multiple applications of rabin.

The run time for the tool is about twice that of gzip.  I have done
nothing to enable the tool to do its processing while the disk read is
going on or to preallocate memory, so I'm sure that time can be
reduced.  Obviously, time spent reading the data from the drive would
not need to be repeated if the tool was integrated into gzip.  My swag
is that integrating a performance optimized version of the tool into
gzip would result in an additional 20% runtime.

The memory usage of the tool is currently not bounded.  8 bytes are
used for every rabin chunk encountered; memory usage will increase
linearly with the file size  Limiting the memory requirement would not
be difficult and my feeling is that a reasonable limit, intelligently
implemented, would not significantly impact the efficacy of the tool.
Memory usage currently should be about 10% of file size for reasonable
compression parameters.


*Tasks for which the tool is useful*
The main things that it’s good for are:
 * compression
 * identifying the parts of a file (text or binary) that have changed
 * minimizing the data needed to store multiple versions of a file
 * lots of other things.  For example:
    * you can identify which binaries are infected with a virus by
knowing the hash for some of the chunks in the middle of the virus and
checking the binary to see if it also contains those chunks
    * you can automatically detect viruses (even if you have never
seen the virus before) by scanning your outgoing traffic for the same
chunks going out to many different machines

rabin can do simple inline compression of a file.  Longish sequences
of data that are repeated in the file are replaced by references to
the first time that data occurs, no matter how far away the repetition
is from the first occurrence.

It can decompress what it compresses.


The tool provides the following support to enable tasks other than
inline compression:

It can split a file out into chunks and throw all the chunks into a
directory as separate files.  Chunks are identified by their Rabin
hash (currently; I may switch to another hashing algorithm).  If a
chunk occurs multiple times in the file, it only appears once in the
directory.

It can list all of the chunks, chunk lengths, and chunk ids in the
file.  It can do this in combination with creating the chunk files.

I have a script that can reassemble chunks into the original file
based on the chunks + the list of chunks; I will eventually put this
functionality directly into the tool.



Again, I would be happy to do the work to get this into gzip if that's
something that could reasonably be done.  Otherwise, I welcome
suggestions (directly to me to avoid spamming the gzip list) of places
I should announce the tool to make sure interested users hear about it
without spamming too many people who don't care.

Thanks!
Bobby

[Prev in Thread]

Current Thread

[Next in Thread]

Compression tool based on Rabin chunks - integrate into gzip?, Bobby Martin <=
- Re: Compression tool based on Rabin chunks - integrate into gzip?, Jim Meyering, 2010/02/25
  - Re: Compression tool based on Rabin chunks - integrate into gzip?, Bobby Martin, 2010/02/25

Prev by Date: gzip breaks : read from STDIN & write to STDOUT : on certain hosts
Next by Date: Re: Compression tool based on Rabin chunks - integrate into gzip?
Previous by thread: gzip breaks : read from STDIN & write to STDOUT : on certain hosts
Next by thread: Re: Compression tool based on Rabin chunks - integrate into gzip?
Index(es):
- Date
- Thread