[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: improve performance of a script
From: |
Pádraig Brady |
Subject: |
Re: improve performance of a script |
Date: |
Wed, 26 Mar 2014 12:54:12 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 03/25/2014 02:12 PM, xeon Mailinglist wrote:
> For each file inside the directory $output, I do a cat to the file and
> generate a sha256 hash. This script takes 9 minutes to read 105 files, with
> the total data of 556MB and generate the digests. Is there a way to make this
> script faster? Maybe generate digests in parallel?
>
> for path in $output
> do
> # sha256sum
> digests[$count]=$( $HADOOP_HOME/bin/hdfs dfs -cat "$path" | sha256sum |
> awk '{ print $1 }')
> (( count ++ ))
> done
This is not a bach question so please ask in a more appropriate user
oriented rather than developer oriented list in future.
Off the top of my head I'd do something like the following to get xargs to
parallelize:
digests=( $(
find "$output" -type f |
xargs -I '{}' -n1 -P$(nproc) \
sh -c "$HADOOP_HOME/bin/hdfs dfs -cat '{}' | sha256sum" |
cut -f1 -d' '
) )
You might want to distribute that load across systems too
with something like dxargs or perhaps something like hadoop :p
thanks,
Pádraig.