Re: [EXTERNAL] Re: Performance issues using GAWK 3.1.6 ->from Win 2008 t

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [EXTERNAL] Re: Performance issues using GAWK 3.1.6 ->from Win 2008 t

From:	Eli Zaretskii
Subject:	Re: [EXTERNAL] Re: Performance issues using GAWK 3.1.6 ->from Win 2008 to Win 2016
Date:	Wed, 16 Jun 2021 16:29:40 +0300

> From: Ed Morton <mortoneccc@comcast.net>
> Date: Wed, 16 Jun 2021 08:01:01 -0500
> Cc: "Pirane, Marco" <Marco.Pirane@pseg.com>,
>  "bug-gawk@gnu.org" <bug-gawk@gnu.org>, "Pereira,
>  Ricardo" <Ricardo_D.Pereira@pseg.com>
> 
> Some (obfuscated of course) sample input and expected output to test 
> with would be nice!

I tried to recreate the issue, based on the descriptions that were
posted here.  Bottom line: the original script indeed runs very slowly
(for reasons that were already identified), but I don't see any
significant difference between Windows 7 (the equivalent of Server
2012) and Windows 10 (Server 2016).

Here are some details; feel free to ask more if someone is interested.

First, I created a 100,000-line manylines.csv file with random
contents, see the attached script manylines.awk.  I also created a
shorter 10,000-line somelines.csv file with 10 random fields for each
line; see the attached script somelines.awk.  I then wrote the script
simple2.awk, also attached below, which does stuff similar to the
original map_attr.awk script.  And finally, I timed the following
command:

  gawk -v f2=somelines.csv -f simple2.awk < manylines.csv > output.csv

This runs for very similar times: approximately 14 min, on both a
Windows 10 machine and a Windows 7 machine.  Note that the CSV file
was only 100,000 lines, half of what the OP uses, so the fact that the
time is shorter is IMO justified.  Also, my CSV file's fields are
numerical, so I think Gawk used numerical rather than string
comparison, which might also be faster.  But the running time is
nevertheless abysmally long, due to the inefficient algorithm: the
script consumes input at a rate of just 100 lines per second.  (For
comparison, it takes just about 3sec to read, without any other
processing, a 2-million line CSV file on the same system.)

I also tried to use the "pipe from TYPE" method.  With my CSV files
and scripts, this made almost no difference: the same 14 minutes.
Which means the pipe is not a significant factor here: expected given
the very slow rate of reading the input.

Finally, I've built a profiling version of Gawk and profiled this
run.  Nothing stands out in the profile, AFAICT, but maybe others will
see something I didn't.  The top of the profile looks as follows:

  Flat profile:

  Each sample counts as 0.01 seconds.
    %   cumulative   self              self     total           
   time   seconds   seconds    calls   s/call   s/call  name    
   11.00     31.23    31.23                             awk_hash
   10.85     62.04    30.81        1    30.81    89.79  r_interpret
    9.15     88.02    25.98                             __strtodg
    5.25    102.92    14.90                             rs1scan
    4.82    116.60    13.68                             sc_parse_field
    4.20    128.51    11.91                             get_a_record
    4.09    140.12    11.61 2000100000     0.00     0.00  cmp_awknums
    3.80    150.92    10.80 1000100001     0.00     0.00  r_force_number
    3.63    161.21    10.29                             purge_record
    2.97    169.63     8.42 1005502010     0.00     0.00  get_field
    2.85    177.71     8.08 4000200003     0.00     0.00  cmp_nodes
    2.68    185.31     7.60 4006402047     0.00     0.00  r_unref
    2.59    192.65     7.34                             r_make_number
    2.57    199.94     7.29 1000100000     0.00     0.00  set_record
    2.49    207.00     7.06 1000100000     0.00     0.00  redirect_string
    2.32    213.59     6.59                             cmp_scalars
    2.14    219.65     6.06 1000100000     0.00     0.00  do_getline_redir
    2.06    225.49     5.84                             str_exists
    1.99    231.15     5.66                             __strtod
    1.89    236.51     5.36                             __lshift_D2A
    1.76    241.51     5.00                             __increment_D2A
    1.71    246.36     4.85 1005502009     0.00     0.00  r_get_field
    1.71    251.20     4.84                             __Balloc_D2A
    1.57    255.65     4.45 1000100000     0.00     0.00  redirect
    1.49    259.88     4.23 1711835360     0.00     0.00  mpfr_unset
    1.46    264.03     4.15                             set_field
    1.28    267.66     3.63                             __d2b_D2A
    1.26    271.24     3.58                             __Bfree_D2A
    1.24    274.75     3.51                             __hexnan_D2A

Thoughts?

Here are the files I promised to attach:

manylines.awk
Description: Binary data

somelines.awk
Description: Binary data

simple2.awk
Description: Binary data

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [EXTERNAL] Re: Performance issues using GAWK 3.1.6 ->from Win 2008 to Win 2016, (continued)

Prev by Date: Re: [EXTERNAL] Re: Performance issues using GAWK 3.1.6 ->from Win 2008 to Win 2016
Next by Date: Re: [EXTERNAL] Re: Performance issues using GAWK 3.1.6 ->from Win 2008 to Win 2016
Previous by thread: Re: [EXTERNAL] Re: Performance issues using GAWK 3.1.6 ->from Win 2008 to Win 2016
Next by thread: RE: [EXTERNAL] Re: Performance issues using GAWK 3.1.6 ->from Win 2008 to Win 2016
Index(es):
- Date
- Thread