help-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Counting words, fast!


From: Dennis Williamson
Subject: Re: Counting words, fast!
Date: Fri, 19 Mar 2021 13:30:16 -0500

On Fri, Mar 19, 2021, 12:31 PM Koichi Murase <myoga.murase@gmail.com> wrote:

> 2021年3月17日(水) 5:32 Jesse Hathaway <jesse@mbuki-mvuki.org>:
> >
> > Ben Hoyt wrote a fun blog post on counting words,
> > https://benhoyt.com/writings/count-words, I wrote a Bash version with
> > as many optimizations as I could find[1], I would love any suggestions
> > on any substantial optimizations, the code is attached. The full
> > repo and test file is here: https://github.com/benhoyt/countwords
> >
> > Yours kindly, Jesse
> >
> > [1]: https://github.com/benhoyt/countwords/pull/66
>
> Maybe, it is a bit late but some comments.
>
> * Bash indexed arrays are implemented by a sparse linked list (of
> key-value pairs), i.e., Bash indexed arrays are already a kind of
> sorted dictionary of (integer-)key & value. So, you don't have to
> construct the associative array and sort its keys by yourself, but you
> can just construct an indexed array and print it by "${array[@]}". A
> non-trivial thing is that the problem requires inputting the result in
> descending order with respect to the frequency. For this one needs to
> revert the array elements.
>
> * Usually shorter variable names are faster in Bash scripts. The
> speedup by this one is small for this script because this is not the
> bottleneck of this script, but there is indeed a difference.
>
> * Usually, `mapfile' is much faster than `read'. But somehow in this
> case, the implementation by `read -N 65536' seems to be faster than
> `mapfile' implementation. Because ${arr[*]} seems to have the
> quadratic time complexity O(N^2) under some conditions, there is a
> finite suitable size of a block to process at once which is something
> between 32768..65536 in this problem. `mapfile' cannot control the
> data size of the processed input through its options.
>
> * «while LANG=C IFS= read ...; do ...» is slow because Bash
> reinitializes the locales of the C library every time any shell
> variables LANG or LC_* is changed. One can instead set LANG outside
> the while statement and restore it after the while statement.
>
> I haven't thoroughly tested it, but here's my optimization (to the
> latest version using read -N 65536):
>
> --------------------
> #!/bin/bash
>
> # load
> declare -iA h
> set -f
> LANG=C E=
> until [[ $E ]]; do
>   IFS= read -N 65536 -r a || E=1
>   IFS= read -r b || E=1
>   for w in ${a@L}${b@L}; do
>     h[$w]+=1
>   done
> done
>
> # construct outputs
> for w in "${!h[@]}"; do
>   f=${h[$w]}
>   o[f]+=$w' '$f$'\n'
> done
>
> # reverse
> ((i=0,j=${#o[@]}-1))
> while ((j>=0)); do r[i++]=${o[j--]}; done
>
> printf '%s' "${r[@]}"
> --------------------
>
> --
> Koichi
>

Your optimizations trim another >30% off the best times of the other Bash
versions.

>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]