[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#23113: parallel gzip processes trash hard disks, need larger buffers

From: Chevreux, Bastien
Subject: bug#23113: parallel gzip processes trash hard disks, need larger buffers
Date: Tue, 29 Mar 2016 23:03:44 +0000

> From: address@hidden [mailto:address@hidden On Behalf Of Jim Meyering
> [...]
> However, I suggest that you consider using xz in place of gzip.
> Not only can it compress better, it also works faster for comparable 
> compression ratios.

xz is not a viable alternative in this case: use case is not archiving. There 
is a plethora of programs out there with zlib support compiled in and these 
won't work on xz packed data. Furthermore, gzip -1 is approximately 4 times 
faster than xz -1 on FASTQ files (sequencing data), and the use case here is 
"temporary results, so ok-ish compression in a comparatively short amount of 
time". Gzip is ideal in that respect as even at -1 it compresses down to 
~25-35% ... and that already helps a lot when you do not need 1 TiB of hard 
disk but only ~350 GiB. Gzip -1 takes ~4.5 hrs, xz -1 almost a day.

> That said, if you find that setting gzip.h's INBUFSIZ or OUTBUFSIZ to larger 
> values makes a significant difference, we'd like to hear about the results 
> and how you measured.

Changing the INBUFSIZ did not have the effect hoped for as this is just the 
buffer size allocated by gzip ... but in the end it uses only 64k at most  and 
the calls to the file system read() even end up to request only 32k per call.

I traced this down through multiple layers to the function fill_window() in 
deflate.c, where things get really intricate using multiple pre-set variables, 
defines and memcpy()s. It became clear that the code is geared towards using a 
64k buffer with a rolling window of 32k. Optimised for 16 bit machines that is.

There are a few mentions of SMALL_MEM, MEDIUM_MEM and BIG_MEM variants via 
defines. However, code comments say that BIG_MEM would work on a complete file 
loaded in memory ... which is a no-go for files in the area of 15 to 30 GiB. 
I'm not even sure the code would be doing what the comments say.

Long story short: I do not feel expert enough to touch said functions and 
change them to provide for larger input buffering. If I were forced to 
implement something I'd try it with an outer buffering layer, but I'm not sure 
it would be elegant or even efficient.


PS: then again I'm toying with the idea to write a simple gzip-packer 
replacement which simply buffers data and passes it to zlib.

DSM Nutritional Products Microbia Inc | Bioinformatics
60 Westview Street | Lexington, MA 02421 | United States
Phone +1 781 259 7613 | Fax +1 781 259 0615


This e-mail is for the intended recipient only.
If you have received it by mistake please let us know by reply and then delete 
it from your system; access, disclosure, copying, distribution or reliance on 
any of it by anyone else is prohibited.
If you as intended recipient have received this e-mail incorrectly, please 
notify the sender (via e-mail) immediately.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]