[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: improve performance of a script

From: Greg Wooledge
Subject: Re: improve performance of a script
Date: Wed, 26 Mar 2014 09:26:11 -0400
User-agent: Mutt/

On Wed, Mar 26, 2014 at 12:54:12PM +0000, Pádraig Brady wrote:
> On 03/25/2014 02:12 PM, xeon Mailinglist wrote:
> > For each file inside the directory $output, I do a cat to the file and
> > generate a sha256 hash. This script takes 9 minutes to read 105 files, with
> > the total data of 556MB and generate the digests. Is there a way to make
> > this script faster? Maybe generate digests in parallel?

First, determine where the bottleneck is.  Is it CPU power to run the
hash command?  Or is it I/O (network? disk?) to read the files?

> > for path in $output
> > do
> >     # sha256sum
> >     digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum | 
> > awk '{ print $1 }')
> >     (( count ++ ))
> > done

If $output is actually the name of a directory, then your syntax is
somewhat off.  It should be:

for file in "$output"/*; do
  digests[count++]=$( ... "$file" ... )

I wouldn't use "output" as the name of a variable that holds a directory,
though.  That's confusing.

I don't know what a hadoop is, or an hdfs, or a dfs... in any case, you
do not appear to be "catting to" a file.  The files appear to be some kind
of input, not output (appended or overwritten).

If there are only 105 input files, and therefore 105 loop iterations,
then optimizing the bashy parts of the code to reduce forks isn't likely
to do very much.  Supposing you removed the awk, that would only save
you 105 forks, which is not likely to be noticeable (we're talking
milliseconds here) when the whole loop takes 9 minutes.

> Off the top of my head I'd do something like the following to get xargs to
> parallelize:

Running multiple hdfs-whatevers in parallel may make the problem worse,
if that's where the bottleneck is.

Running multiple sha256sums in parallel would only help if the computer
has multiple CPU cores, and if the CPU happens to be the bottleneck here.
If it's a single-core machine, running multiple CPU-heavy processes in
parallel would just make it worse, because you'd introduce a whole bunch
of extra context switching.

Really, we don't have anywhere near enough information about the problem
to give a solution.  We can only give suggestions.

And this should be on help-bash, not bug-bash.  I've Cc'ed the former.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]