chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performance question concerning chicken flonum vs "foreign flonum"


From: Christian Himpe
Subject: Re: Performance question concerning chicken flonum vs "foreign flonum"
Date: Fri, 05 Nov 2021 22:17:07 +0100 (CET)

felix.winkelmann@bevuta.com schrieb am 2021-11-04:
> > 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> > 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB
> >
> >[...]
> >
> > It would be great to get some help or explanation with this issue.

> Hi!

> I have similar timings and the difference in the number of minor GC indicates
> that the c99-fma variant allocates more stack space and thus causes more
> minor GCs.

> Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the 
> intermediate
> result and thus generates relatively decent code:

> /* scm-fma in k183 in k180 in k177 in k174 */
> static void C_ccall f_187(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t5;
> double f0;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(4,c,1)))){
> C_save_and_reclaim((void *)f_187,c,av);}
> a=C_alloc(4);
> f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3));
> t5=t1;{
> C_word *av2=av;
> av2[0]=t5;
> av2[1]=C_flonum(&a,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0));
> ((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}}

> The other version allocates a bytevector to hold the result:

> /* c99-fma in k183 in k180 in k177 in k174 */
> static void C_ccall f_197(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t5;
> C_word t6;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(6,c,1)))){
> C_save_and_reclaim((void *)f_197,c,av);}
> a=C_alloc(6);
> t5=C_a_i_bytevector(&a,1,C_fix(4));
> t6=t1;{
> C_word *av2=av;
> av2[0]=t6;
> av2[1]=stub21(t5,t2,t3,t4);
> ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

> I thought that the allocation of 4 words for the bytevector (which is more 
> than
> needed on a 64 bit machine) makes the difference, but it turns out to be 
> negligible
> Changing it to 2 and also adjusting the values for C_calculate_demand and
> C_alloc doesn't seem to change a lot, but you may want to try that -
> just modify the C code and compile it with the same options as the .scm file.

> On my laptop fma is a library call, so currently my guess is simply that
> the scm-fma code is tighter and avoids 3 additional function calls (one to 
> the stub,
> one to C_a_i_bytevector and one to fma). The increased number of GCs may
> also be caused by the bytevector above, which is used as a placeholder for
> the flonum result, which wastes one word.

> There is room for improvement for the compiler, though: the C_fix(4) is overly
> conservative (4 words are correct on 32-bit, taking care of flonum alignment, 
> but
> unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we
> could actually just pass "a" to stub21 directly. You may want to try this out:

> /* c99-fma in k183 in k180 in k177 in k174 (modified) */
> static void C_ccall f_197(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t6;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(4,c,1)))){
> C_save_and_reclaim((void *)f_197,c,av);}
> a=C_alloc(4);
> t6=t1;{
> C_word *av2=av;
> av2[0]=t6;
> av2[1]=stub21((C_word)a,t2,t3,t4);
> ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

> This reduces minor GCs on my machine to roughly the same. If your
> compiler inlines stub21 and fma, then you should see comparable performance.
> Also, default optimization-levels for C are -Os (pass -v to csc to see what is
> passed to the C compiler), so using -O2 instead should make a difference.


> felix

Dear Felix,

thank you for ypur explanantions. I tested your modified source and indeed the 
number of GCs is significantly reduced, but the timing difference remains:

original code:

7.656s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
8.849s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB

modified code:

7.378s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
8.498s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB

Both were compiled with -O3 optimization level in gcc.

I am fine with these results given your layout of the internals in the 
background.

Would it be theoretically thinkable to include such fma functionality directly 
into chicken.flonum, i.e. as fp+*, or are included modules typically unaltered?

Thank you

Christian



reply via email to

[Prev in Thread] Current Thread [Next in Thread]