[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] md5: accepts a new --threads option

From: Pádraig Brady
Subject: Re: [PATCH] md5: accepts a new --threads option
Date: Tue, 20 Oct 2009 11:11:00 +0100
User-agent: Thunderbird (X11/20071008)

Pádraig Brady wrote:
> Giuseppe Scrivano wrote:
>> Hello,
>> inspired by the attempt to make `sort' multi-threaded, I added threads
>> support to md5sum and the sha* programs family.  It has effect only when
>> multiple files are specified.
>> Any comment?
> How does it compare to:
> files_per_process=10
> cpus=4
> find files | xargs -n$files_per_process -P$cpus md5sum
> I would expect it to be a bit better as file_per_process
> could be very large, thus having less overhead in starting
> processes. Though is the benefit worth the extra implementation
> complexity and new less general interface for users?

Expanding a bit on why I don't think this should be added...

You don't gain much by splitting the work per file as
the UNIX toolkit is already well equipped as to process
multiple files in parallel with:

  find files | xargs -n$files_per_process -P$processes md5sum

That is a more general solution and works for any command
or collection of commands (script). Also more generally
the work could be split across multiple machines (in the case
where the processing cost is bigger than transmission), using
ssh or whatever:

  find files | dxargs¹ ...

Also one often wants to split the work per data source
rather than per CPU and so would need a variant of the
above rather than a contained threaded solution. Consider
the case where you have files on separate disks (heads).
You wouldn't want multiple threads/processes fighting over
the disk head so you would do something like:

  find /disk1 | xargs md5sum & find /disk2 | xargs md5sum

Note if we're piping/redirecting the output of the above
then we must be careful to line buffer the output from md5sum
so that it's not interspersed. Hmm I wonder should
we linebuffer the output from *sum by default.
In the meantime one can check for the correct output by
varying the -o parameter in the following:

   find /etc | xargs ./stdbuf -oL md5sum &
   find /etc | xargs ./stdbuf -oL md5sum
  ) 2>/dev/null | sed -n '/[^ ]\{32\}/!p'

Now it's a different story if the data within a file
could be processed in parallel. I.E. if the digest
algorithms themselves could be parallelized.
The higher the processing cost compared to the I/O cost,
the bigger the benefit would be. Doing a very quick
check of these costs on my laptop...

$ timeout -sINT 10 dd bs=32K if=/dev/sda of=/dev/null
347570176 bytes (348 MB) copied, 10.004 s, 34.7 MB/s

$ timeout -sINT 10 dd bs=32K if=/dev/zero | ./md5sum
1816690688 bytes (1.8 GB) copied, 10.0002 s, 182 MB/s

$ timeout -sINT 10 dd bs=32K if=/dev/zero | ./cat >/dev/null
9205088256 bytes (9.2 GB) copied, 10.0514 s, 916 MB/s

$ timeout -sINT 10 dd bs=32K if=/dev/zero of=/dev/null
48931995648 bytes (49 GB) copied, 10.0314 s, 4.9 GB/s

Note there is some low hanging fruit with speeding up md5sum et. al.
They seem to use stdio needlessly, thus introducing data copying.
Also there is an improved sha1 floating around that's 25% more efficient.


¹ http://www.semicomplete.com/blog/geekery/distributed-xargs.html

reply via email to

[Prev in Thread] Current Thread [Next in Thread]