bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13243: [PATCH] enhancement: modify md5sum to allow piping


From: Daniel Santos
Subject: bug#13243: [PATCH] enhancement: modify md5sum to allow piping
Date: Thu, 20 Dec 2012 16:09:38 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.11) Gecko/20121128 Thunderbird/10.0.11

There are many times, usually when doing system backups, maintenance, recovery, etc., that I would like to pipe large files through md5sum to produce or verify a hash so that I do not have to read the file multiple times. This is especially the case when backing up a system from a livecd across the network

dd if=/dev/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
or
tar c /mnt/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678

Attached is a preliminary patch set that will allow for this as in the following example

dd if=/dev/sda3 | pbzip2 -c2 | md5sum -po /tmp/sda3.dat.bzip2.md5 | netcat 192.168.1.123 45678

-p is short for --pipe and -o <filename> is short for --outfile <filename>. Then, on the receiving end, the hash can be determined as the file is read, eliminating any worry about network corruption:

netcat -l -p 45678| md5sum -po sda3.dat.bzip2.rx.md5 > sda3.dat.bzip2

The only caveat being that you have to manually compare the sum files, which you can just do by calling diff, a small cost when compared to re-reading a 200GiB file!

You can even get the sum prior to compression, although if you wanted to avoid a duplicate read on the server end, you would have to decompress as you read it and either store the file uncompressed or re-compress it.

dd if=/dev/sda3 | md5sum -po /tmp/sda3.dat.md5 | pbzip2 -c2 | netcat 192.168.1.123 45678
with
netcat -l -p 45678| pbzip2 -cd | md5sum -po sda3.dat.rx.md5 > sda3.dat

The attached patchset is in a very early stage and has many problems:

 * GNU coding style compliance (this coding style is new to me)
 * API in gnulib is changed, may break other apps
 * all changes are lumped together and needs to be broken apart into
   logical changes
 * it has a few hacks that need to be cleaned up

Also, this patch set addresses a problem with the gnulib's hash functions where there was a lot of copy & paste code. I've implemented a mechanism to clean this up w/o a performance hit (as long as we're using gcc 4.6.1+). This change should probably go into a separate patchset & bug report.

Finally, after the cursory amount that I've worked with this code, I see a number of other areas where I believe there's room for improvement.

 * The copy & paste code problem (mentioned above)
 * Centralize the location where BLOCKSIZE is defined and only verify
   it's a multiple of 64 in gnulib/lib/{md,sha}*.c
 * Perhaps allow BLOCKSIZE to be defined at configure time? Honestly,
   I'm not intimately familiar enough with the issues where I can be
   certain it would alter performance on any system, but I'm thinking
   about embedded where reading 32k chunks may end up thrashing the
   cache, but 8k or 4k would not. However, I don't think I would be in
   favor of this being a run-time parameter, as it would seem to be a
   lot of waste (and lost optimizations) for something that's probably
   pretty specific to the hardware and build target.
 * Centralize compiler sniffing into a single gnulib header, (like
   "compiler.h" or some such) and define the GCC_VERSION macro as
   described in
   http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html.
 * Make better use of __builtin_expect via portable likely/unlikely
   macros to make sure error handling code gets moved out of the main
bodies of functions (which can save a cache miss here and there). Of course, this would require the above item to do cleanly.
 * Introduce some tuning parameter in the configure script to choose
   between smaller and larger, but more optimized code.  I bring this
   up mainly because in my re-work of the copy & pasted code, I see a
   large opportunity to create a much smaller executable (if needed),
   but one that would create slightly slower code, which would usually
   be undesirable on a machine with plenty of RAM, storage and CPU cache.

Obviously, these should be made into separate bug reports as well and I can send separate emails for them if you like.

Daniel

Attachment: 0001-md5sum-pipe.patch
Description: Text Data

Attachment: 0001-piping-support.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]