[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-tar] Feature suggestion: reordering files by extension improves com

From: Jari Aalto
Subject: [Bug-tar] Feature suggestion: reordering files by extension improves compression
Date: 13 Dec 2006 11:51:01 +0200

Please consider ordering the files by extension inside tar as this
produces better compression ratios according to Paul Sladen.


    Quick and dirty

    A simpler ordering method, involves clustering based purely on
    filename and extension can be produced with a command similar to:

    cat filelist.txt | rev | sort | rev > neworder.txt

    This sorting process workings by reversing each line in the file;
    hello.text would become txet.olleh allowing files with similar
    file extensions or basenames to be ordered adjacently. The
    filenames are reversed again producing the file order; this method
    appears to work well for language-packs containing translated
    strings, resulting in a 14% improvement using bzip2 compression
    both before and afterwards, or 2% if using gzip (most files are
    larger than the 32kB window size).

    I came across a paper (without source code) which discusses
    pre-ordering for efficient zdelta encoding as well as the tarfile
    ordering: Compressing File Collections with a TSP-Based Approach
    (PDF)[1]. For this paper, a relatively simple, greedy method is
    chosen, yeilding compression improvements of ~10-15% on webpages
    of online news services.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]