bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely la


From: Jason C. Kwan
Subject: Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs
Date: Thu, 27 Jan 2022 12:18:08 -0500

I’ll go modify the code for your convenience , but I didn’t make the code that 
way to give u a hard time - I write code like that for my personal use too. 

I don’t know how to make that function more concise and still process doubling 
1-billion-digit integers with same precision as gmp.

did you think I had Fun and joy taking time out to hash out a fully working 
proof of concept to illustrate my point about the performance difference , in 
hopes of starting a candid discussion instead of just saying “it’s slow” and 
sound like generic whining ? 

the proof of concept function is designed to be ran *outside* of gmp mode -

You want gmp gawk code it’s 

gawk -Mbe ‘{ print $0+$0 }’

that’s it - the point of proof of concept isn’t to illustrate the performance 
of gmp itself - it serves as simply an example of how it’s feasible to use base 
gawk without gmp bignum mode, perform certain tasks with same accuracy , at 
much faster speeds. 

I wasn’t saying  “incorporate my code” then giving you that - it’s to raise 
awareness of how come it’s even feasible , with all the overhead associated 
with interpreted scripting, to perform the same task significantly faster than 
natively built in ones.  Interpreted scripting, by definition, should be 
constrained by the performance of natively built in features. I’m not talking 
about a 5% 10% margin of error thing - I wouldn’t even say anything and risk 
being labeled as a time waster if it’s a small double digit gain.

 It’s somewhat paradoxical to have the scripts go 275-400% speed of the 
underlying binary. 

I perfectly believe both gawk and gmp teams are giving us state of the art 
software, that’s why I’m guessing it maybe a I/O issue between gawk and gmp 
that might be unnecessarily slowing things down. 

but don’t worry about the next time - if I see something, I won’t waste your 
precious time by saying something. 

Regards,
Jason K

> Andrew J. Schorr <aschorr@telemetry-investments.com>於2022年1月27日 08:24寫道:
> 
> Hi,
> 
> As Arnold subtly pointed out, this is a bug reporting list, not a forum for
> discussing obfuscated code challenges.  If you believe there's a gawk
> performance problem, please provide a short, comprehensible program that
> demonstrates the issue.
> 
> Thanks,
> Andy
> 
>> On Thu, Jan 27, 2022 at 08:13:58AM -0500, Jason C. Kwan wrote:
>> Have you tried even just straight up pasting it and run it as is ? when 
>> interpreted scripting for the same platform is generating some 65% reduction 
>> in execution time versus the built in compiled binary from C code source, 
>> don’t you think it should at least warrant a quick look to see if there are 
>> potential I/O bottlenecks between gawk and the GMP add-on module it uses ?
>> 
>> i already used LC_ALL=C locale, per the bug reporting page, as you could see 
>> below. I even let gawk -M go second so if there were any gains from caching, 
>> it would be gawk -M that benefited from it. Maybe it’s an ARM thing, being 
>> that I’m on m1max instead of x64
>> 
>> I only sent here first because I’m somewhat certain the GMP side would 
>> instantly send me back here, seeing that I couldn’t isolate out whether the 
>> potential bottlenecks exists on gawk side or gmp side, and I haven’t written 
>> in C since 2002. I even tried reading the gawk 5.1.1 source code to see if I 
>> could identify anything that could be help, but haven’t been lucky.
>> 
>> Do u want me to read through the gmp source first ?
>> 
>> Regards,
>> Jason K
>> 
>>> arnold@skeeve.com於2022年1月27日 01:42寫道:
>>> 
>>> Hello.
>>> 
>>> Thank you for taking the time to report an issue.
>>> 
>>> Please read the instructions for bug reporting at
>>> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
>>> It was updated recently, please reread it if you haven't looked at
>>> it in a long time.
>>> 
>>> I'm afraid that your "proof of concept" code is so unreadable that
>>> I won't even try to determine what it's doing.
>>> 
>>> With respect to GMP performance, I suggest that you report an
>>> issue directly to the GMP developers.  The gawk code that uses
>>> GMP isn't going to change.  For this reason, I don't think you
>>> need to bother redoing your example code, unless you want to send
>>> it to the GMP developers.
>>> 
>>> Thanks,
>>> 
>>> Arnold
>>> 
>>> "Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org> wrote:
>>> 
>>>> Hi GAWK team,
>>>> Not sure if I should be reporting this to the gawk team or the GnuMP team 
>>>> - it's neither a bug or nor new feature per se, but merely a performance 
>>>> one - I've noticed on that for extremely large inputs, say, integers with 
>>>> more than 5 million digits, the bignum gawk -M mode could be somewhat 
>>>> slow, with possible room for speed improvement (proof-of-concept attached 
>>>> below). my gawk version information : 
>>>> GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)
>>>> Darwin MacBook-Pro.local 21.2.0 Darwin Kernel Version 21.2.0: Sun Nov 28 
>>>> 20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000 arm64
>>>> On my test-case of 71.5 million digits, it's possible to to save 63.7% of 
>>>> time , even in regular gawk, when compared to gawk -M  :
>>>> ==================
>>>> gawk -be '{ print length($0) }' test.txt71591842
>>>> test command ::
>>>> echo; time ( pv -q < test.txt | LC_ALL=C gawkmx -b -e '{ print _2x_($0) }' 
>>>> ) | xxh128sum ; echo; time (pv -q < test.txt | LC_ALL=C gawk -M -b -e '{ 
>>>> print $0+$0 }' ) | xxh128sum ; echo
>>>> a56c2d2302d9ea8751f810b848e6f354  stdin( pv -q < test.txt | LC_ALL=C 
>>>> LC_ALL=C gawk -e "${mfx}" -b) 8.16s user 0.60s system 99% cpu 8.789 
>>>> totalxxh128sum  0.01s user 0.00s system 0% cpu 8.789 total
>>>> a56c2d2302d9ea8751f810b848e6f354  stdin( pv -q < test.txt | LC_ALL=C gawk 
>>>> -M -b -e ; )  23.26s user 0.96s system 99% cpu 24.240 totalxxh128sum  
>>>> 0.01s user 0.00s system 0% cpu 24.239 total
>>>> =====================
>>>> In another test case of slightly over 275 milion digits, the time savings 
>>>> are 71.7% : 
>>>> fc2231bdff375b7870586d8dffc0841c  stdin( pv -q < 
>>>> jwengowengonoewgnwoegn.txt | LC_ALL=C gawk -b -e "${mfx}" -e ; )  27.25s 
>>>> user 6.48s system 98% cpu 34.350 totalxxh128sum  0.04s user 0.02s system 
>>>> 0% cpu 34.349 total
>>>> fc2231bdff375b7870586d8dffc0841c  stdin( pv -q < 
>>>> jwengowengonoewgnwoegn.txt | LC_ALL=C gawk -M -b -e ; )  116.58s user 
>>>> 4.78s system 99% cpu 2:01.42 totalxxh128sum  0.04s user 0.02s system 0% 
>>>> cpu 2:01.42 total
>>>> ====================
>>>> Attached below is the full proof-of-concept code for function _2x_( ) to 
>>>> demostrate that the time savings are very much concrete and possible, not 
>>>> simply theoretical. The test file, being 70MB+, is a big large for email, 
>>>> but basically any file using ASCII digits 0-9 to represent any integer 
>>>> over 5 million digits will do. The speed difference isn't noticeable for 
>>>> smaller inputs, and for inputs fewer than 7 digits, most likely it would 
>>>> be slower than gawk -M. 
>>>> I tried maximizing portability of the proof-of-concept function by 
>>>> refraining from any gawk-specific extensions of awk - this same code has 
>>>> also been tested in mawk 1.3.4, mawk 1.9.9.6, and macos 12.1 awk/nawk. 
>>>> It's entirely self-contained, performs no recursion, has no external 
>>>> dependencies, no bit-wise ops, and doesn't include any advanced/fancy math 
>>>> - just straight up grade-school long-form addition, doubling them 
>>>> 15-digits per chunk, and 2 chunks per while loop. The carrying part is 
>>>> performed by gsub( ) prior to the while( ) loop, and thus, eliminating the 
>>>> need to track them along the way. Performance scales linearly at the 
>>>> log-base-10 level.
>>>> ( I didn't include any copyright notice or credits to a priori since I 
>>>> don't think grade school addition is something copyrightable)
>>>> Obviously this is simply awk scripting code and can't be directly 
>>>> incorporated into the C code base - I'm merely raising awareness of the 
>>>> issue.
>>>> Thanks for your time.
>>>> Jason
>>>> ===================================================================================
>>>> ( I can reformat the code for readability if you prefer )  : 
>>>> function _2x_(__,_,_______,____,_________,      
>>>> ___________,__________,____________,              
>>>> ________,_____,______,___) {    if(__!~/[1-9]/) {        return +_ }    
>>>> ___=(__~/^[-]/)    sub("^[+-]?["(-"")"]*","",__)    if 
>>>> (length(__)<(_____=((_+=++_)^_^_-!!_))+_) {        if 
>>>> (_________^((_____+(_~____))^(_____-_)<+__)) {            return 
>>>> (-!-"")^___*(+__+__)    } }    ___=substr("-",_~"",___)    if 
>>>> (__!~"[5-9]") {        gsub(/4/,"8",__)-gsub(/3/,"6",__)        
>>>> gsub(/2/,"4",__)-gsub(/1/,"2",__)        return (___)__    }; 
>>>> _______=____=________="";    
>>>> __=(_____=substr(________=(_++^_--+_)^_____,_))(__)    gsub(/./,".",_____) 
>>>>         sub("("(_____)")+$","_&",__)    __=substr(__,index(__,"_")+!+"")   
>>>>  ________+=+________;_="";    if 
>>>> ((gsub(_____,"&_",__)+gsub(/[_][5-9]/,".5&",__)*(+""))%2) {        
>>>> _=(":")(________+__+__)        __=substr(__,index(__,"_")+!+"")    }    
>>>> for(_____ in ______) { }    if (______[_____=split(__,______,/[_]/)]=="") 
>>>> {        delete ______[_____--] };__=____~____;    
>>>> __________=____________*=\    ____________=___________=_________=32;    
>>>> while(__<_____) { if(!--__________){        
>>>> __________=____________;_______=(_______)_;_="";        
>>>> if(!--___________){        
>>>> ___________=_________;____=(____)_______;_______="";        } }        
>>>> _=(_)(":")(+______[__++]*2+________)\             
>>>> (":")(+______[__++]*2+________) }    ____=(____)(_______)(_)\         
>>>> (__==_____?(":")(________+2*______[_____]):"")    gsub(/[:]+[^:]/,"",____) 
>>>>          sub(/^0*/,"",____); return (___)____ }
>> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]