bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely la


From: Jason C. Kwan
Subject: Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs
Date: Thu, 27 Jan 2022 08:13:58 -0500

Have you tried even just straight up pasting it and run it as is ? when 
interpreted scripting for the same platform is generating some 65% reduction in 
execution time versus the built in compiled binary from C code source, don’t 
you think it should at least warrant a quick look to see if there are potential 
I/O bottlenecks between gawk and the GMP add-on module it uses ?

i already used LC_ALL=C locale, per the bug reporting page, as you could see 
below. I even let gawk -M go second so if there were any gains from caching, it 
would be gawk -M that benefited from it. Maybe it’s an ARM thing, being that 
I’m on m1max instead of x64

I only sent here first because I’m somewhat certain the GMP side would 
instantly send me back here, seeing that I couldn’t isolate out whether the 
potential bottlenecks exists on gawk side or gmp side, and I haven’t written in 
C since 2002. I even tried reading the gawk 5.1.1 source code to see if I could 
identify anything that could be help, but haven’t been lucky.

Do u want me to read through the gmp source first ?

Regards,
Jason K

> arnold@skeeve.com於2022年1月27日 01:42寫道:
> 
> Hello.
> 
> Thank you for taking the time to report an issue.
> 
> Please read the instructions for bug reporting at
> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> It was updated recently, please reread it if you haven't looked at
> it in a long time.
> 
> I'm afraid that your "proof of concept" code is so unreadable that
> I won't even try to determine what it's doing.
> 
> With respect to GMP performance, I suggest that you report an
> issue directly to the GMP developers.  The gawk code that uses
> GMP isn't going to change.  For this reason, I don't think you
> need to bother redoing your example code, unless you want to send
> it to the GMP developers.
> 
> Thanks,
> 
> Arnold
> 
> "Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org> wrote:
> 
>> Hi GAWK team,
>> Not sure if I should be reporting this to the gawk team or the GnuMP team - 
>> it's neither a bug or nor new feature per se, but merely a performance one - 
>> I've noticed on that for extremely large inputs, say, integers with more 
>> than 5 million digits, the bignum gawk -M mode could be somewhat slow, with 
>> possible room for speed improvement (proof-of-concept attached below). my 
>> gawk version information : 
>> GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)
>> Darwin MacBook-Pro.local 21.2.0 Darwin Kernel Version 21.2.0: Sun Nov 28 
>> 20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000 arm64
>> On my test-case of 71.5 million digits, it's possible to to save 63.7% of 
>> time , even in regular gawk, when compared to gawk -M  : 
>> ==================
>> gawk -be '{ print length($0) }' test.txt71591842
>> test command ::
>> echo; time ( pv -q < test.txt | LC_ALL=C gawkmx -b -e '{ print _2x_($0) }' ) 
>> | xxh128sum ; echo; time (pv -q < test.txt | LC_ALL=C gawk -M -b -e '{ print 
>> $0+$0 }' ) | xxh128sum ; echo
>> a56c2d2302d9ea8751f810b848e6f354  stdin( pv -q < test.txt | LC_ALL=C 
>> LC_ALL=C gawk -e "${mfx}" -b) 8.16s user 0.60s system 99% cpu 8.789 
>> totalxxh128sum  0.01s user 0.00s system 0% cpu 8.789 total
>> a56c2d2302d9ea8751f810b848e6f354  stdin( pv -q < test.txt | LC_ALL=C gawk -M 
>> -b -e ; )  23.26s user 0.96s system 99% cpu 24.240 totalxxh128sum  0.01s 
>> user 0.00s system 0% cpu 24.239 total
>> =====================
>> In another test case of slightly over 275 milion digits, the time savings 
>> are 71.7% : 
>> fc2231bdff375b7870586d8dffc0841c  stdin( pv -q < jwengowengonoewgnwoegn.txt 
>> | LC_ALL=C gawk -b -e "${mfx}" -e ; )  27.25s user 6.48s system 98% cpu 
>> 34.350 totalxxh128sum  0.04s user 0.02s system 0% cpu 34.349 total
>> fc2231bdff375b7870586d8dffc0841c  stdin( pv -q < jwengowengonoewgnwoegn.txt 
>> | LC_ALL=C gawk -M -b -e ; )  116.58s user 4.78s system 99% cpu 2:01.42 
>> totalxxh128sum  0.04s user 0.02s system 0% cpu 2:01.42 total
>> ====================
>> Attached below is the full proof-of-concept code for function _2x_( ) to 
>> demostrate that the time savings are very much concrete and possible, not 
>> simply theoretical. The test file, being 70MB+, is a big large for email, 
>> but basically any file using ASCII digits 0-9 to represent any integer over 
>> 5 million digits will do. The speed difference isn't noticeable for smaller 
>> inputs, and for inputs fewer than 7 digits, most likely it would be slower 
>> than gawk -M. 
>> I tried maximizing portability of the proof-of-concept function by 
>> refraining from any gawk-specific extensions of awk - this same code has 
>> also been tested in mawk 1.3.4, mawk 1.9.9.6, and macos 12.1 awk/nawk. 
>> It's entirely self-contained, performs no recursion, has no external 
>> dependencies, no bit-wise ops, and doesn't include any advanced/fancy math - 
>> just straight up grade-school long-form addition, doubling them 15-digits 
>> per chunk, and 2 chunks per while loop. The carrying part is performed by 
>> gsub( ) prior to the while( ) loop, and thus, eliminating the need to track 
>> them along the way. Performance scales linearly at the log-base-10 level.
>> ( I didn't include any copyright notice or credits to a priori since I don't 
>> think grade school addition is something copyrightable) 
>> Obviously this is simply awk scripting code and can't be directly 
>> incorporated into the C code base - I'm merely raising awareness of the 
>> issue.
>> Thanks for your time.
>> Jason
>> ===================================================================================
>> ( I can reformat the code for readability if you prefer )  : 
>> function _2x_(__,_,_______,____,_________,      
>> ___________,__________,____________,              ________,_____,______,___) 
>> {    if(__!~/[1-9]/) {        return +_ }    ___=(__~/^[-]/)    
>> sub("^[+-]?["(-"")"]*","",__)    if 
>> (length(__)<(_____=((_+=++_)^_^_-!!_))+_) {        if 
>> (_________^((_____+(_~____))^(_____-_)<+__)) {            return 
>> (-!-"")^___*(+__+__)    } }    ___=substr("-",_~"",___)    if (__!~"[5-9]") 
>> {        gsub(/4/,"8",__)-gsub(/3/,"6",__)        
>> gsub(/2/,"4",__)-gsub(/1/,"2",__)        return (___)__    }; 
>> _______=____=________="";    
>> __=(_____=substr(________=(_++^_--+_)^_____,_))(__)    gsub(/./,".",_____)   
>>       sub("("(_____)")+$","_&",__)    __=substr(__,index(__,"_")+!+"")    
>> ________+=+________;_="";    if 
>> ((gsub(_____,"&_",__)+gsub(/[_][5-9]/,".5&",__)*(+""))%2) {        
>> _=(":")(________+__+__)        __=substr(__,index(__,"_")+!+"")    }    
>> for(_____ in ______) { }    if (______[_____=split(__,______,/[_]/)]=="") {  
>>       delete ______[_____--] };__=____~____;    __________=____________*=\   
>>  ____________=___________=_________=32;    while(__<_____) { 
>> if(!--__________){        __________=____________;_______=(_______)_;_="";   
>>      if(!--___________){        
>> ___________=_________;____=(____)_______;_______="";        } }        
>> _=(_)(":")(+______[__++]*2+________)\             
>> (":")(+______[__++]*2+________) }    ____=(____)(_______)(_)\         
>> (__==_____?(":")(________+2*______[_____]):"")    gsub(/[:]+[^:]/,"",____)   
>>        sub(/^0*/,"",____); return (___)____ }



reply via email to

[Prev in Thread] Current Thread [Next in Thread]