bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely la


From: Andrew J. Schorr
Subject: Re: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs
Date: Thu, 27 Jan 2022 12:27:09 -0500
User-agent: Mutt/1.5.21 (2010-09-15)

Gosh, I'm not trying to insult you. If that's how you perceived my response,
then I apologize. Working in the open-source community is hard. When I find
problems with open-source software, it often takes me a great deal of time
to isolate the problem, report it properly, and pursue a resolution
with the maintainers.

The code you sent is just impossible to comprehend. It makes it very hard
to eyeball the problem and troubleshoot it. I'm sure we're all happy 
to learn about bugs and performance problems to see if they can be
addressed or ameliorated, but the maintainers are not getting paid to
do this work. We all have day jobs and other demands on our time.

By all means, we thank you for identifying a potential problem. But please
help us to understand what the problem is by sharing some code that we
have half a chance of understanding. I didn't say that would be easy for
you. But nothing about maintaining and improving open-source software
is easy. That's the nature of the beast.

Also, you might try profiling the code to see where it's spending its time.

Regards,
Andy

On Thu, Jan 27, 2022 at 12:18:08PM -0500, Jason C. Kwan wrote:
> I’ll go modify the code for your convenience , but I didn’t make the code 
> that way to give u a hard time - I write code like that for my personal use 
> too. 
> 
> I don’t know how to make that function more concise and still process 
> doubling 1-billion-digit integers with same precision as gmp.
> 
> did you think I had Fun and joy taking time out to hash out a fully working 
> proof of concept to illustrate my point about the performance difference , in 
> hopes of starting a candid discussion instead of just saying “it’s slow” and 
> sound like generic whining ? 
> 
> the proof of concept function is designed to be ran *outside* of gmp mode -
> 
> You want gmp gawk code it’s 
> 
> gawk -Mbe ‘{ print $0+$0 }’
> 
> that’s it - the point of proof of concept isn’t to illustrate the performance 
> of gmp itself - it serves as simply an example of how it’s feasible to use 
> base gawk without gmp bignum mode, perform certain tasks with same accuracy , 
> at much faster speeds. 
> 
> I wasn’t saying  “incorporate my code” then giving you that - it’s to raise 
> awareness of how come it’s even feasible , with all the overhead associated 
> with interpreted scripting, to perform the same task significantly faster 
> than natively built in ones.  Interpreted scripting, by definition, should be 
> constrained by the performance of natively built in features. I’m not talking 
> about a 5% 10% margin of error thing - I wouldn’t even say anything and risk 
> being labeled as a time waster if it’s a small double digit gain.
> 
>  It’s somewhat paradoxical to have the scripts go 275-400% speed of the 
> underlying binary. 
> 
> I perfectly believe both gawk and gmp teams are giving us state of the art 
> software, that’s why I’m guessing it maybe a I/O issue between gawk and gmp 
> that might be unnecessarily slowing things down. 
> 
> but don’t worry about the next time - if I see something, I won’t waste your 
> precious time by saying something. 
> 
> Regards,
> Jason K
> 
> > Andrew J. Schorr <aschorr@telemetry-investments.com>於2022年1月27日 08:24寫道:
> > 
> > Hi,
> > 
> > As Arnold subtly pointed out, this is a bug reporting list, not a forum for
> > discussing obfuscated code challenges.  If you believe there's a gawk
> > performance problem, please provide a short, comprehensible program that
> > demonstrates the issue.
> > 
> > Thanks,
> > Andy
> > 
> >> On Thu, Jan 27, 2022 at 08:13:58AM -0500, Jason C. Kwan wrote:
> >> Have you tried even just straight up pasting it and run it as is ? when 
> >> interpreted scripting for the same platform is generating some 65% 
> >> reduction in execution time versus the built in compiled binary from C 
> >> code source, don’t you think it should at least warrant a quick look to 
> >> see if there are potential I/O bottlenecks between gawk and the GMP add-on 
> >> module it uses ?
> >> 
> >> i already used LC_ALL=C locale, per the bug reporting page, as you could 
> >> see below. I even let gawk -M go second so if there were any gains from 
> >> caching, it would be gawk -M that benefited from it. Maybe it’s an ARM 
> >> thing, being that I’m on m1max instead of x64
> >> 
> >> I only sent here first because I’m somewhat certain the GMP side would 
> >> instantly send me back here, seeing that I couldn’t isolate out whether 
> >> the potential bottlenecks exists on gawk side or gmp side, and I haven’t 
> >> written in C since 2002. I even tried reading the gawk 5.1.1 source code 
> >> to see if I could identify anything that could be help, but haven’t been 
> >> lucky.
> >> 
> >> Do u want me to read through the gmp source first ?
> >> 
> >> Regards,
> >> Jason K
> >> 
> >>> arnold@skeeve.com於2022年1月27日 01:42寫道:
> >>> 
> >>> Hello.
> >>> 
> >>> Thank you for taking the time to report an issue.
> >>> 
> >>> Please read the instructions for bug reporting at
> >>> https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
> >>> It was updated recently, please reread it if you haven't looked at
> >>> it in a long time.
> >>> 
> >>> I'm afraid that your "proof of concept" code is so unreadable that
> >>> I won't even try to determine what it's doing.
> >>> 
> >>> With respect to GMP performance, I suggest that you report an
> >>> issue directly to the GMP developers.  The gawk code that uses
> >>> GMP isn't going to change.  For this reason, I don't think you
> >>> need to bother redoing your example code, unless you want to send
> >>> it to the GMP developers.
> >>> 
> >>> Thanks,
> >>> 
> >>> Arnold
> >>> 
> >>> "Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org> wrote:
> >>> 
> >>>> Hi GAWK team,
> >>>> Not sure if I should be reporting this to the gawk team or the GnuMP 
> >>>> team - it's neither a bug or nor new feature per se, but merely a 
> >>>> performance one - I've noticed on that for extremely large inputs, say, 
> >>>> integers with more than 5 million digits, the bignum gawk -M mode could 
> >>>> be somewhat slow, with possible room for speed improvement 
> >>>> (proof-of-concept attached below). my gawk version information : 
> >>>> GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)
> >>>> Darwin MacBook-Pro.local 21.2.0 Darwin Kernel Version 21.2.0: Sun Nov 28 
> >>>> 20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000 arm64
> >>>> On my test-case of 71.5 million digits, it's possible to to save 63.7% 
> >>>> of time , even in regular gawk, when compared to gawk -M  :
> >>>> ==================
> >>>> gawk -be '{ print length($0) }' test.txt71591842
> >>>> test command ::
> >>>> echo; time ( pv -q < test.txt | LC_ALL=C gawkmx -b -e '{ print _2x_($0) 
> >>>> }' ) | xxh128sum ; echo; time (pv -q < test.txt | LC_ALL=C gawk -M -b -e 
> >>>> '{ print $0+$0 }' ) | xxh128sum ; echo
> >>>> a56c2d2302d9ea8751f810b848e6f354  stdin( pv -q < test.txt | LC_ALL=C 
> >>>> LC_ALL=C gawk -e "${mfx}" -b) 8.16s user 0.60s system 99% cpu 8.789 
> >>>> totalxxh128sum  0.01s user 0.00s system 0% cpu 8.789 total
> >>>> a56c2d2302d9ea8751f810b848e6f354  stdin( pv -q < test.txt | LC_ALL=C 
> >>>> gawk -M -b -e ; )  23.26s user 0.96s system 99% cpu 24.240 
> >>>> totalxxh128sum  0.01s user 0.00s system 0% cpu 24.239 total
> >>>> =====================
> >>>> In another test case of slightly over 275 milion digits, the time 
> >>>> savings are 71.7% : 
> >>>> fc2231bdff375b7870586d8dffc0841c  stdin( pv -q < 
> >>>> jwengowengonoewgnwoegn.txt | LC_ALL=C gawk -b -e "${mfx}" -e ; )  27.25s 
> >>>> user 6.48s system 98% cpu 34.350 totalxxh128sum  0.04s user 0.02s system 
> >>>> 0% cpu 34.349 total
> >>>> fc2231bdff375b7870586d8dffc0841c  stdin( pv -q < 
> >>>> jwengowengonoewgnwoegn.txt | LC_ALL=C gawk -M -b -e ; )  116.58s user 
> >>>> 4.78s system 99% cpu 2:01.42 totalxxh128sum  0.04s user 0.02s system 0% 
> >>>> cpu 2:01.42 total
> >>>> ====================
> >>>> Attached below is the full proof-of-concept code for function _2x_( ) to 
> >>>> demostrate that the time savings are very much concrete and possible, 
> >>>> not simply theoretical. The test file, being 70MB+, is a big large for 
> >>>> email, but basically any file using ASCII digits 0-9 to represent any 
> >>>> integer over 5 million digits will do. The speed difference isn't 
> >>>> noticeable for smaller inputs, and for inputs fewer than 7 digits, most 
> >>>> likely it would be slower than gawk -M. 
> >>>> I tried maximizing portability of the proof-of-concept function by 
> >>>> refraining from any gawk-specific extensions of awk - this same code has 
> >>>> also been tested in mawk 1.3.4, mawk 1.9.9.6, and macos 12.1 awk/nawk. 
> >>>> It's entirely self-contained, performs no recursion, has no external 
> >>>> dependencies, no bit-wise ops, and doesn't include any advanced/fancy 
> >>>> math - just straight up grade-school long-form addition, doubling them 
> >>>> 15-digits per chunk, and 2 chunks per while loop. The carrying part is 
> >>>> performed by gsub( ) prior to the while( ) loop, and thus, eliminating 
> >>>> the need to track them along the way. Performance scales linearly at the 
> >>>> log-base-10 level.
> >>>> ( I didn't include any copyright notice or credits to a priori since I 
> >>>> don't think grade school addition is something copyrightable)
> >>>> Obviously this is simply awk scripting code and can't be directly 
> >>>> incorporated into the C code base - I'm merely raising awareness of the 
> >>>> issue.
> >>>> Thanks for your time.
> >>>> Jason
> >>>> ===================================================================================
> >>>> ( I can reformat the code for readability if you prefer )  : 
> >>>> function _2x_(__,_,_______,____,_________,      
> >>>> ___________,__________,____________,              
> >>>> ________,_____,______,___) {    if(__!~/[1-9]/) {        return +_ }    
> >>>> ___=(__~/^[-]/)    sub("^[+-]?["(-"")"]*","",__)    if 
> >>>> (length(__)<(_____=((_+=++_)^_^_-!!_))+_) {        if 
> >>>> (_________^((_____+(_~____))^(_____-_)<+__)) {            return 
> >>>> (-!-"")^___*(+__+__)    } }    ___=substr("-",_~"",___)    if 
> >>>> (__!~"[5-9]") {        gsub(/4/,"8",__)-gsub(/3/,"6",__)        
> >>>> gsub(/2/,"4",__)-gsub(/1/,"2",__)        return (___)__    }; 
> >>>> _______=____=________="";    
> >>>> __=(_____=substr(________=(_++^_--+_)^_____,_))(__)    
> >>>> gsub(/./,".",_____)         sub("("(_____)")+$","_&",__)    
> >>>> __=substr(__,index(__,"_")+!+"")    ________+=+________;_="";    if 
> >>>> ((gsub(_____,"&_",__)+gsub(/[_][5-9]/,".5&",__)*(+""))%2) {        
> >>>> _=(":")(________+__+__)        __=substr(__,index(__,"_")+!+"")    }    
> >>>> for(_____ in ______) { }    if 
> >>>> (______[_____=split(__,______,/[_]/)]=="") {        delete 
> >>>> ______[_____--] };__=____~____;    __________=____________*=\    
> >>>> ____________=___________=_________=32;    while(__<_____) { 
> >>>> if(!--__________){        
> >>>> __________=____________;_______=(_______)_;_="";        
> >>>> if(!--___________){        
> >>>> ___________=_________;____=(____)_______;_______="";        } }        
> >>>> _=(_)(":")(+______[__++]*2+________)\             
> >>>> (":")(+______[__++]*2+________) }    ____=(____)(_______)(_)\         
> >>>> (__==_____?(":")(________+2*______[_____]):"")    
> >>>> gsub(/[:]+[^:]/,"",____)          sub(/^0*/,"",____); return (___)____ }
> >> 
> 

-- 
Andrew Schorr                      e-mail: aschorr@telemetry-investments.com
Telemetry Investments, L.L.C.      phone:  917-305-1748
152 W 36th St, #402                fax:    212-425-5550
New York, NY 10018-8765



reply via email to

[Prev in Thread] Current Thread [Next in Thread]