help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to Generate a Long String of the Same Character


From: Bob Proulx
Subject: Re: How to Generate a Long String of the Same Character
Date: Sun, 18 Jul 2021 22:59:53 -0600

Neil R. Ormos wrote:
> In a message on the bug-gawk list, Ed Mortin wrote:
> That should have been "Ed Morton".
> > On an online forum someone asked how to generate a
> > string of 100,000,000 "x"s. They had tried this in
> > a BEGIN section:
> > 
> >    for(i=1;i<=100000000;i++) s = s "x"
>...
> Building a big string by iterating in tiny chunks
> would seem to invite poor performance.

Agreed.  Growing by one character at a time definitely seems
inefficient.

> Instead, why not append the string to itself,
> doubling its size with each iteration?  For
> example:
> 
> time ~/.local/bin/gawk-5.1.0 \
>   'BEGIN{sizelim=100000000; a="x"; while (length(a) < sizelim) {a=a a}; 
> a=substr(a, 1, sizelim); print length(a);}'

I think that is probably one of the best ways with awk.

My mind first thought that it would be better to produce a file that
contained 100 million "x"s and then read it into awk.

    awk '{print length($0)}' < bigfileofx

Of course that simply changes the problem around to creating that
file!  This is rather a silly response but it's fun just the same.

Well...  There are certainly many ways to do it.  I would use dd for
creating the byte stream of the right size.  But there seems no way to
use dd to produce "x" characters.  But it can read /dev/zero okay.
And tr can translate zeros to other characters such as an "x".

    $ dd status=none if=/dev/zero bs=1 count=10 | tr "\0" "x"; echo
    xxxxxxxxxx

    $ dd status=none if=/dev/zero bs=1 count=10 | tr "\0" "x" | wc -c
    10

That looks promising.  Let's fire it up for the requested 100 million
size.

    $ time dd status=none if=/dev/zero bs=1M count=100 | tr "\0" "x" | wc -c    
                                                                                
                         
    104857600

    real    0m0.179s
    user    0m0.126s
    sys     0m0.167s

Looks like the right size.  Let's get it into awk.

    $ time dd status=none if=/dev/zero bs=1M count=100 | tr "\0" "x" |  awk 
'{print length($0)}'
    104857600

    real    0m0.624s
    user    0m0.451s
    sys     0m0.398s

That's looking pretty good.  Let's compare it against the reference
above so one can see how slow my machine is about such things.

    $ time awk 'BEGIN{sizelim=100000000; a="x"; while (length(a) < sizelim) 
{a=a a}; a=substr(a, 1, sizelim); print length(a);}'                            
                             
    100000000

    real    0m1.469s
    user    0m0.815s
    sys     0m0.654s

I am running this on an older Intel Core i5 CPU 750 2.67GHz.

> On my not-very-fast machine, according to the time
> built-in, that takes 0.17 seconds of elapsed time.

Faster than my daily driving desktop!  :-)

> Yes, worst-case, if the intended string has length
> (2^N)+1, you wastefully build a string of size
> 2^(N+1) and trim off almost half.  So maybe on
> some machines, building the string in
> single-character units would work but the doubling
> would not.

Fun stuff!  And illustrates the usefulness of benchmarking to collect
data.

Bob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]