bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Memory leak


From: Stephane Delsert
Subject: Re: [bug-gawk] Memory leak
Date: Mon, 27 Mar 2017 18:03:42 +0000

Hi,

I've joined a little sample and a little script if you want create a bigger 
file. This script doesn't change the initial order. My user sort function uses 
2 internal tables that could be a research way and I tried to  make a test with 
a setting of those tables in the BEGIN statement but without success. 
Normally I use gawk as filter for simple processing. The number of lines in 
input and in output is huge but the processes remain simple. This tool is 
already highly powerful and I had processed several billions of lines with high 
performances nevertheless I will study  all opportunities that this extension 
can offer.

Great thanks ,
Regards,

St├ęphane.



-----Original Message-----
From: Andrew J. Schorr [mailto:address@hidden 
Sent: lundi 27 mars 2017 17:20
To: Stephane Delsert <address@hidden>
Cc: address@hidden; Fatima Aliane <address@hidden>; Vihan_Sharma - Vihan Sharma 
(LiveRamp) <address@hidden>
Subject: Re: [bug-gawk] Memory leak

Hi,

Thanks for bug report. Is it possible for you to supply a small sample dataset 
that can be used with this script?

Also, gawk's array implementation currently incurs a lot of overhead for each 
array entry saved. I think the last time I measured this, it was around 253 
bytes per array element when the index and the value were both strings. Since 
you are using numeric indices, the overhead should be less, but it still can 
consume a tremendous amount of memory. If you load 320 million records, that 
might come to tens of GB of overhead. Are you certain that the 
PROCINFO["sorted_in"] setting really matters? I wonder if this is simply a 
problem with gawk array overhead.

For working with massive datasets, you might consider trying the gawkextlib 
lmdb extension. It is very fast and handles large key-value stores. You can 
download it here:
   https://sourceforge.net/projects/gawkextlib/files/

Regards,
Andy

On Mon, Mar 27, 2017 at 02:42:28PM +0000, Stephane Delsert wrote:
> Hi,
> 
> We hit a memory leak with gawk for the joined script. This script sorts a 
> file already sorted on primary keys for additional keys. For achieve this I 
> used a user defined function and set this function as follow :the  
> PROCINFO["sorted_in"]="__sort_subsort"
> We noticed a growth of memory required by gawk with the increase of the 
> processed records. Gawk  ended after over 320 MM of records. The memory size 
> was over 20Gb. A post analysis shown that the maximum size of the tables of 
> the script was 121 elements.
> I made different tests and it appears that this issue doesn't arrive when I 
> don't use PROCINFO mechanism. For little files, this script works correctly.
> 
> I didn't see this kind of bug in the bug reports. I made tests with version 
> 4.1.3 and version 4.1.4 without success.
> 
> Thank you for your help.
> 
> Best regards,
> 
> St├ęphane Delsert.
> 
> **********************************************************************
> ***** The information contained in this communication is confidential, 
> is intended only for the use of the recipient named above, and may be 
> legally privileged.
> 
> If the reader of this message is not the intended recipient, you are 
> hereby notified that any dissemination, distribution or copying of 
> this communication is strictly prohibited.
> 
> If you have received this communication in error, please resend this 
> communication to the sender and delete the original message or any 
> copy of it from your computer system.
> 
> Thank You.
> **********************************************************************
> ******

>       BEGIN {
>               FS="|"
>               OFS="|"
>               
>                       sort_old_key_1=""
>                       sort_old_key_2=""
>                       sort_old_key_3=""
>                       sort_old_key_4=""
>                       sort_old_key_5=""
>                       sort_old_key_6=""
>                       sort_old_key_7=""
>                       sort_old_key_8=""
>                       sort_old_key_9=""       
>               split("", tab_store);
>               split("", subsort_tab1);
>               split("", subsort_tab2);
>               nb_tab_store=0;
>               PROCINFO["sorted_in"]="__sort_subsort"
>       }
>       {
>               FIELD0=$1
> FIELD1=$2
> FIELD2=$3
> FIELD3=$4
> FIELD4=$5
> FIELD5=$6
> FIELD6=$7
> FIELD7=$8
> FIELD8=$9
> FIELD9=$10
> FIELD10=$11
> FIELD11=$12
>               
> sort_key_1=" " FIELD2
> sort_key_2=" " FIELD3
> sort_key_3=" " FIELD4
> sort_key_4=" " FIELD5
> sort_key_5=" " FIELD6
> sort_key_6=" " FIELD7
> sort_key_7=" " FIELD8
> sort_key_8=" " FIELD1
> sort_key_9=" " FIELD9 
>               sort_prim_compare = ( ( sort_old_key_1 < sort_key_1 ) ? -1 : ( 
> ( sort_old_key_1 == sort_key_1 ) ? 0 : 1 ) );
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_2 < sort_key_2 ) ? -1 :  ( ( ( sort_old_key_2 == 
> sort_key_2 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_3 < sort_key_3 ) ? -1 :  ( ( ( sort_old_key_3 == 
> sort_key_3 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_4 < sort_key_4 ) ? -1 :  ( ( ( sort_old_key_4 == 
> sort_key_4 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_5 < sort_key_5 ) ? -1 :  ( ( ( sort_old_key_5 == 
> sort_key_5 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_6 < sort_key_6 ) ? -1 :  ( ( ( sort_old_key_6 == 
> sort_key_6 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_7 < sort_key_7 ) ? -1 :  ( ( ( sort_old_key_7 == 
> sort_key_7 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_8 < sort_key_8 ) ? -1 :  ( ( ( sort_old_key_8 == 
> sort_key_8 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( 
> sort_old_key_9 < sort_key_9 ) ? -1 :  ( ( ( sort_old_key_9 == 
> sort_key_9 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
>               
>       if ( ( sort_prim_compare > 0 ) && ( NR > 1 ) ) {
>               print "file not correctly sorted at " NR " line " > 
> ".sortcsv.sh_14831_S.acx_error_message.9d"
>               exit 9
>       }
>               
> sort_sec_key_1=" " FIELD11 
>               if ( ( sort_prim_compare != 0 ) || ( NR == 1 ) ) {
>                       if ( nb_tab_store > 1 ) {
>                               for ( sort_tmp_line in tab_store )  {
>                                       print tab_store[sort_tmp_line] ; 
>                               }
>                       }
>                       else  {
>                               if ( nb_tab_store > 0 )  {
>                                       print tab_store[0] ; 
>                               }
>                       }
>                       
>                               sort_old_key_1= sort_key_1 
>                               sort_old_key_2= sort_key_2 
>                               sort_old_key_3= sort_key_3 
>                               sort_old_key_4= sort_key_4 
>                               sort_old_key_5= sort_key_5 
>                               sort_old_key_6= sort_key_6 
>                               sort_old_key_7= sort_key_7 
>                               sort_old_key_8= sort_key_8 
>                               sort_old_key_9= sort_key_9 
>                       split("", tab_store);
>                       nb_tab_store=0;
>               }
>               $1=$1
>               tab_store[nb_tab_store] = sort_sec_key_1  OFS $0
>               nb_tab_store += 1;
>       }
> 
> 
>       END {
>               for ( sort_tmp_line in tab_store  ) {
>                       print tab_store[sort_tmp_line] ; 
>               }
>       }
>       function __sort_subsort(i1,v1,i2,v2) 
>       {
>               nb_subsort_tab1 = split(v1, subsort_tab1 );
>               nb_subsort_tab2 = split(v2, subsort_tab2 );
> 
>               sort_sec_compare = ( ( subsort_tab1[1] < subsort_tab2[1] ) ? -1 
> : ( 
> ( subsort_tab1[1] == subsort_tab2[1] ) ? 0 : 1 ) );
>               
>               return(sort_sec_compare)        
>       }

Attachment: samplegnu.zip
Description: samplegnu.zip


reply via email to

[Prev in Thread] Current Thread [Next in Thread]