[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: feature request: gzip/bzip support for sort

From: Jim Meyering
Subject: Re: feature request: gzip/bzip support for sort
Date: Thu, 18 Jan 2007 21:58:26 +0100

Paul Eggert <address@hidden> wrote:
> Jim Meyering <address@hidden> writes:
>> So, with just one trial each, I see a 19% speed-up.
> Yaayyy!  That's good news.  Thanks for timing it.  I read your email
> just after talking with Dan (in person) about how we'd time it.  I
> just bought 1 TB worth of disk for my home computer and hadn't hooked
> it up yet, so was going to volunteer that, but you beat me to it.

I've done some more timings, but with two more sizes of input.
Here's the summary, comparing straight sort with sort --comp=gzip:

  2.7GB:   6.6% speed-up
  10.0GB: 17.8% speed-up

For the smaller input, I also did as James Youngman suggested
and used "cat" as the no-op compressor/decompressor.
That made sort run 34% longer.


Here's the smaller input:
  $ seq 9999999 > k
  $ cat k k k k k k k k k > j
  $ cat j j j j > sort-in
  $ wc -c sort-in
  2839999968 sort-in

With --compress=gzip:
  $ /usr/bin/time ./sort -T. --compress=gzip < sort-in > out
  814.07user 29.97system 14:50.16elapsed 94%CPU (0avgtext+0avgdata 
0maxresident)k  0inputs+0outputs (4major+2821589minor)pagefaults 0swaps

With no --compress= option:
  $ /usr/bin/time ./sort -T. < sort-in > out
  398.98user 17.08system 15:53.49elapsed 43%CPU (0avgtext+0avgdata 
0maxresident)k  0inputs+0outputs (2major+229797minor)pagefaults 0swaps

With --compress=$PWD/cat-wrap:
  [where the cat-wrap script accepts and ignores the -d option:
   printf '#!/bin/sh\ntest $# != 0 && test x$1 = x-d && shift; exec cat "$@"' \
     > cat-wrap
   chmod a+x cat-wrap

   BTW, this example demonstrates already how it'd be nice to be able to
   specify a decompressor separately: when the decompressor isn't "compressor 
  $ /usr/bin/time ./sort -T. --compress=$PWD/cat-wrap < sort-in > out
  439.67user 54.02system 19:50.86elapsed 41%CPU (0avgtext+0avgdata 
0maxresident)k  0inputs+0outputs (1major+2817586minor)pagefaults 0swaps

Using a 10GB data set (exactly 10737418240 bytes),
formed by concatenating four copies of the above and then truncating
to the desired length, ...

  $ /usr/bin/time ./sort -T. --compress=gzip < sort-in > out; Rm out
  3330.45user 139.57system 1:00:10elapsed 96%CPU (0avgtext+0avgdata 
  0inputs+0outputs (5major+10679797minor)pagefaults 0swaps

  $ /usr/bin/time ./sort -T. < sort-in > out; Rm out
  1643.09user 86.83system 1:13:13elapsed 39%CPU (0avgtext+0avgdata 
  0inputs+0outputs (2major+233951minor)pagefaults 0swaps

The result: an 18% speed-up.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]