bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Memory leak


From: arnold
Subject: Re: [bug-gawk] Memory leak
Date: Tue, 28 Mar 2017 08:25:19 -0600
User-agent: Heirloom mailx 12.4 7/29/08

Hi.

Since you're on Linux you should have valgrind available. It would be best
if you run valgrind on gawk in your setup.

It really helps to compile gawk without optimization to get accurate
line number info:

        wget http://ftp.gnu.org/gnu/gawk/gawk-4.1.4.tar.gz
        tar -xpzvf gawk-4.1.4.tar.gz
        cd gawk-4.1.4
        ./configure
        # Edit Makfile and remove -O2 from the compilation flags
        make

Using the gawk you compiled, something like:

        # from your pipeline
        zcat sdelse1.pip.gz.5t | tail -n +2 | 
                valgrind --leak-check=ful gawk -F '|' -f test.awk 2> REPORT |
                        gzip > out.sdelse1.pip.gz.5t

Then send us the REPORT file. This should tell us if there are real
memory leaks or not, and will help keep Andy and me from trashing trying to
reproduce the problem.

Alternatively, you could make your exact data file available for us
to download (via off-list mail) and we'll destroy it when we're done.

Thanks,

Arnold

Stephane Delsert <address@hidden> wrote:

> Hi,
>
> To see the problem you have to duplicate the sample file to reach at least a 
> couple of MM of records .
>
> I'm running the script on a virtual Linux server
> Linux 2.6.32-642.el6.x86_64 #1 SMP Wed Apr 13 00:51:26 EDT 2016 x86_64 x86_64 
> x86_64 GNU/Linux
> From a Redhat distribution 6.8
>
> I launched the process and stop at different points :
>
> The command : 
> zcat sdelse1.pip.gz.5t | tail -n +2 | gawk -F '|' -f test.awk | gzip > 
> out.sdelse1.pip.gz.5t
>  
> It's what I have after the script has processed over 16 MM of records : 
> PID      USER      PR  NI  VIRT   RES     SHR  S  %CPU    TIME+  %MEM COMMAND
> 14772 sdelse    20   0    213m 109m 1016 T  0.0           1:34.33  0.3        
>  gawk
>
> After 24 MM of records :
> PID      USER      PR  NI  VIRT   RES     SHR   S  %CPU    TIME+  %MEM COMMAND
> 14772 sdelse    20   0   287m  182m  1016 T   0.0           2:24.95  0.6      
>   gawk
>
> After 36 MM of records 
> PID      USER      PR  NI  VIRT   RES     SHR   S  %CPU    TIME+  %MEM COMMAND
> 14772 sdelse    20   0    548m 443m 1016  T   0.0          3:52.30  1.4       
>   gawk
>
> After 54 MM of records :
> PID      USER       PR  NI  VIRT   RES     SHR   S  %CPU    TIME+    %MEM 
> COMMAND
> 14772  sdelse    20   0    950m 845m  1016 T   0.0           6:08.44  2.6     
>      gawk
>
> And After 110 MM of records :
> PID      USER       PR  NI  VIRT     RES     SHR   S  %CPU    TIME+       
> %MEM COMMAND
> 14772 sdelse     20   0    2164m 2.0g    1016 T  0.0            13:05.17  6.4 
>           gawk
>
> The progression of the memory leak  seems linear and be linked to an 
> allocation memory during the call of the sort subscript.
>
> Great thanks.
>
> Best regards,
>
> Stéphane.
>
>
>
>
> -----Original Message-----
> From: Andrew J. Schorr [mailto:address@hidden 
> Sent: lundi 27 mars 2017 20:15
> To: Stephane Delsert <address@hidden>
> Cc: address@hidden; Fatima Aliane <address@hidden>; Vihan_Sharma - Vihan 
> Sharma (LiveRamp) <address@hidden>
> Subject: Re: [bug-gawk] Memory leak
>
> Hi,
>
> I don't see any memory growth at all using version 4.1.4.
> Am I running this correctly?
>
> bash-4.2$ gawk --version | head -1
> GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.1, GNU MP 6.0.0)
>
> bash-4.2$ wc sample4gnu.pip 10x.pip 
>     345     345   91894 sample4gnu.pip
>    3450    3450  918940 10x.pip
>    3795    3795 1010834 total
>
> bash-4.2$ /bin/time gawk -f test.awk sample4gnu.pip
>  
> FIELD11|FIELD0|FIELD1|FIELD2|FIELD3|FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11
> 0.00user 0.00system 0:00.01elapsed 9%CPU (0avgtext+0avgdata 1748maxresident)k
> 0inputs+8outputs (0major+518minor)pagefaults 0swaps
>
> bash-4.2$ /bin/time gawk -f test.awk 10x.pip
>  
> FIELD11|FIELD0|FIELD1|FIELD2|FIELD3|FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11
> 0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 1748maxresident)k
> 0inputs+8outputs (0major+518minor)pagefaults 0swaps
>
> Regards,
> Andy
>
> On Mon, Mar 27, 2017 at 06:03:42PM +0000, Stephane Delsert wrote:
> > Hi,
> > 
> > I've joined a little sample and a little script if you want create a bigger 
> > file. This script doesn't change the initial order. My user sort function 
> > uses 2 internal tables that could be a research way and I tried to  make a 
> > test with a setting of those tables in the BEGIN statement but without 
> > success. 
> > Normally I use gawk as filter for simple processing. The number of lines in 
> > input and in output is huge but the processes remain simple. This tool is 
> > already highly powerful and I had processed several billions of lines with 
> > high performances nevertheless I will study  all opportunities that this 
> > extension can offer.
> > 
> > Great thanks ,
> > Regards,
> > 
> > Stéphane.
> > 
> > 
> > 
> > -----Original Message-----
> > From: Andrew J. Schorr [mailto:address@hidden
> > Sent: lundi 27 mars 2017 17:20
> > To: Stephane Delsert <address@hidden>
> > Cc: address@hidden; Fatima Aliane <address@hidden>; 
> > Vihan_Sharma - Vihan Sharma (LiveRamp) <address@hidden>
> > Subject: Re: [bug-gawk] Memory leak
> > 
> > Hi,
> > 
> > Thanks for bug report. Is it possible for you to supply a small sample 
> > dataset that can be used with this script?
> > 
> > Also, gawk's array implementation currently incurs a lot of overhead for 
> > each array entry saved. I think the last time I measured this, it was 
> > around 253 bytes per array element when the index and the value were both 
> > strings. Since you are using numeric indices, the overhead should be less, 
> > but it still can consume a tremendous amount of memory. If you load 320 
> > million records, that might come to tens of GB of overhead. Are you certain 
> > that the PROCINFO["sorted_in"] setting really matters? I wonder if this is 
> > simply a problem with gawk array overhead.
> > 
> > For working with massive datasets, you might consider trying the gawkextlib 
> > lmdb extension. It is very fast and handles large key-value stores. You can 
> > download it here:
> >    https://sourceforge.net/projects/gawkextlib/files/
> > 
> > Regards,
> > Andy
> > 
> > On Mon, Mar 27, 2017 at 02:42:28PM +0000, Stephane Delsert wrote:
> > > Hi,
> > > 
> > > We hit a memory leak with gawk for the joined script. This script sorts a 
> > > file already sorted on primary keys for additional keys. For achieve this 
> > > I used a user defined function and set this function as follow :the  
> > > PROCINFO["sorted_in"]="__sort_subsort"
> > > We noticed a growth of memory required by gawk with the increase of the 
> > > processed records. Gawk  ended after over 320 MM of records. The memory 
> > > size was over 20Gb. A post analysis shown that the maximum size of the 
> > > tables of the script was 121 elements.
> > > I made different tests and it appears that this issue doesn't arrive when 
> > > I don't use PROCINFO mechanism. For little files, this script works 
> > > correctly.
> > > 
> > > I didn't see this kind of bug in the bug reports. I made tests with 
> > > version 4.1.3 and version 4.1.4 without success.
> > > 
> > > Thank you for your help.
> > > 
> > > Best regards,
> > > 
> > > Stéphane Delsert.
> > > 
> > > ********************************************************************
> > > **
> > > ***** The information contained in this communication is 
> > > confidential, is intended only for the use of the recipient named 
> > > above, and may be legally privileged.
> > > 
> > > If the reader of this message is not the intended recipient, you are 
> > > hereby notified that any dissemination, distribution or copying of 
> > > this communication is strictly prohibited.
> > > 
> > > If you have received this communication in error, please resend this 
> > > communication to the sender and delete the original message or any 
> > > copy of it from your computer system.
> > > 
> > > Thank You.
> > > ********************************************************************
> > > **
> > > ******
> > 
> > >   BEGIN {
> > >           FS="|"
> > >           OFS="|"
> > >           
> > >                   sort_old_key_1=""
> > >                   sort_old_key_2=""
> > >                   sort_old_key_3=""
> > >                   sort_old_key_4=""
> > >                   sort_old_key_5=""
> > >                   sort_old_key_6=""
> > >                   sort_old_key_7=""
> > >                   sort_old_key_8=""
> > >                   sort_old_key_9=""       
> > >           split("", tab_store);
> > >           split("", subsort_tab1);
> > >           split("", subsort_tab2);
> > >           nb_tab_store=0;
> > >           PROCINFO["sorted_in"]="__sort_subsort"
> > >   }
> > >   {
> > >           FIELD0=$1
> > > FIELD1=$2
> > > FIELD2=$3
> > > FIELD3=$4
> > > FIELD4=$5
> > > FIELD5=$6
> > > FIELD6=$7
> > > FIELD7=$8
> > > FIELD8=$9
> > > FIELD9=$10
> > > FIELD10=$11
> > > FIELD11=$12
> > >           
> > > sort_key_1=" " FIELD2
> > > sort_key_2=" " FIELD3
> > > sort_key_3=" " FIELD4
> > > sort_key_4=" " FIELD5
> > > sort_key_5=" " FIELD6
> > > sort_key_6=" " FIELD7
> > > sort_key_7=" " FIELD8
> > > sort_key_8=" " FIELD1
> > > sort_key_9=" " FIELD9 
> > >           sort_prim_compare = ( ( sort_old_key_1 < sort_key_1 ) ? -1 : ( 
> > > ( sort_old_key_1 == sort_key_1 ) ? 0 : 1 ) );
> > >                                   sort_prim_compare = ( sort_prim_compare 
> > > == 0 ) ? ( ( sort_old_key_2 < sort_key_2 ) ? -1 :  ( ( ( sort_old_key_2 
> > > == sort_key_2 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> > >                                   sort_prim_compare = ( sort_prim_compare 
> > > == 0 ) ? ( ( sort_old_key_3 < sort_key_3 ) ? -1 :  ( ( ( sort_old_key_3 
> > > == sort_key_3 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> > >                                   sort_prim_compare = ( sort_prim_compare 
> > > == 0 ) ? ( ( sort_old_key_4 < sort_key_4 ) ? -1 :  ( ( ( sort_old_key_4 
> > > == sort_key_4 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> > >                                   sort_prim_compare = ( sort_prim_compare 
> > > == 0 ) ? ( ( sort_old_key_5 < sort_key_5 ) ? -1 :  ( ( ( sort_old_key_5 
> > > == sort_key_5 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> > >                                   sort_prim_compare = ( sort_prim_compare 
> > > == 0 ) ? ( ( sort_old_key_6 < sort_key_6 ) ? -1 :  ( ( ( sort_old_key_6 
> > > == sort_key_6 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> > >                                   sort_prim_compare = ( sort_prim_compare 
> > > == 0 ) ? ( ( sort_old_key_7 < sort_key_7 ) ? -1 :  ( ( ( sort_old_key_7 
> > > == sort_key_7 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> > >                                   sort_prim_compare = ( sort_prim_compare 
> > > == 0 ) ? ( ( sort_old_key_8 < sort_key_8 ) ? -1 :  ( ( ( sort_old_key_8 
> > > == sort_key_8 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> > >                                   sort_prim_compare = ( sort_prim_compare 
> > > == 0 ) ? ( (
> > > sort_old_key_9 < sort_key_9 ) ? -1 :  ( ( ( sort_old_key_9 ==
> > > sort_key_9 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> > >           
> > >   if ( ( sort_prim_compare > 0 ) && ( NR > 1 ) ) {
> > >           print "file not correctly sorted at " NR " line " > 
> > > ".sortcsv.sh_14831_S.acx_error_message.9d"
> > >           exit 9
> > >   }
> > >           
> > > sort_sec_key_1=" " FIELD11 
> > >           if ( ( sort_prim_compare != 0 ) || ( NR == 1 ) ) {
> > >                   if ( nb_tab_store > 1 ) {
> > >                           for ( sort_tmp_line in tab_store )  {
> > >                                   print tab_store[sort_tmp_line] ; 
> > >                           }
> > >                   }
> > >                   else  {
> > >                           if ( nb_tab_store > 0 )  {
> > >                                   print tab_store[0] ; 
> > >                           }
> > >                   }
> > >                   
> > >                           sort_old_key_1= sort_key_1 
> > >                           sort_old_key_2= sort_key_2 
> > >                           sort_old_key_3= sort_key_3 
> > >                           sort_old_key_4= sort_key_4 
> > >                           sort_old_key_5= sort_key_5 
> > >                           sort_old_key_6= sort_key_6 
> > >                           sort_old_key_7= sort_key_7 
> > >                           sort_old_key_8= sort_key_8 
> > >                           sort_old_key_9= sort_key_9 
> > >                   split("", tab_store);
> > >                   nb_tab_store=0;
> > >           }
> > >           $1=$1
> > >           tab_store[nb_tab_store] = sort_sec_key_1  OFS $0
> > >           nb_tab_store += 1;
> > >   }
> > > 
> > > 
> > >   END {
> > >           for ( sort_tmp_line in tab_store  ) {
> > >                   print tab_store[sort_tmp_line] ; 
> > >           }
> > >   }
> > >   function __sort_subsort(i1,v1,i2,v2) 
> > >   {
> > >           nb_subsort_tab1 = split(v1, subsort_tab1 );
> > >           nb_subsort_tab2 = split(v2, subsort_tab2 );
> > > 
> > >           sort_sec_compare = ( ( subsort_tab1[1] < subsort_tab2[1] ) ? -1 
> > > : 
> > > ( ( subsort_tab1[1] == subsort_tab2[1] ) ? 0 : 1 ) );
> > >           
> > >           return(sort_sec_compare)        
> > >   }
>
> > Archive:  /var/tmp/samplegnu.zip
> > Zip file size: 9071 bytes, number of entries: 3
> > -rw-rw-r--  3.0 unx    91894 tx defN 17-Mar-27 13:30 sample4gnu.pip
> > -rw-rw-r--  3.0 unx       56 tx defN 17-Mar-27 13:36 README.txt
> > -rw-rw-r--  3.0 unx     3427 tx defN 17-Mar-27 10:15 test.awk
> > 3 files, 95377 bytes uncompressed, 8601 bytes compressed:  91.0%
>
>
> -- 
> Andrew Schorr                      e-mail: address@hidden
> Telemetry Investments, L.L.C.      phone:  917-305-1748
> 545 Fifth Ave, Suite 1108          fax:    212-425-5550
> New York, NY 10017-3630
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]