[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Counting words, fast!

From: Dennis Williamson
Subject: Re: Counting words, fast!
Date: Wed, 17 Mar 2021 10:50:30 -0500

On Wed, Mar 17, 2021, 10:34 AM Jesse Hathaway <> wrote:

> On Tue, Mar 16, 2021 at 10:30 PM Dennis Williamson
> <> wrote:
> > I've been playing with your optimized code changing the read to grab
> data in chunks like some of the other optimized code does - thus extending
> your move from by-word to by-line reading to reading a specified larger
> number of characters.
> >
> > IFS= read -r -N 4096 var
> >
> > And appending the result of a regular read to end at a newline. This
> seemed to cut about 20% off the time. But I get different counts than your
> code. I've tried using read without specifying a variable and using the
> resulting $REPLY to preserve whitespace but the counts still didn't match.
> >
> > In any case this points to larger chunks being more efficient.
> Oh! That is a clever idea, I wanted to try reading in larger chunks, but
> I wasn't sure how to ensure I had read whole words until you gave
> this idea. Using 64K chunks I was able to shave off about 7s in my
> testing:
> declare -iA words_to_freq
> eof='false'
> set -o noglob
> while [[ "${eof}" == 'false' ]]; do
>   if ! LANG='C' IFS='' read -N 65536 -r block; then
>     eof='true'
>   fi
>   if ! IFS='' read -r line; then
>     eof='true'
>   fi
>   for word in ${block@L}${line@L}; do
>     words_to_freq["${word}"]+=1
>   done
> done
> set +o noglob

Did you try smaller blocks? I didn't see any difference above 4K. Did you
verify that the counts are correct? Your code is a little different than
mine and may fix the count issue I was having.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]