[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lightning] About jit_roundr_d_i
From: |
Paulo César Pereira de Andrade |
Subject: |
Re: [Lightning] About jit_roundr_d_i |
Date: |
Fri, 10 Sep 2010 09:53:56 -0300 |
Em 10 de setembro de 2010 04:48, Paolo Bonzini <address@hidden> escreveu:
> On 09/10/2010 12:21 AM, Paulo César Pereira de Andrade wrote:
>>
>> The default implementation, assuming following the
>> "round" definition and default (round to nearest) rounding mode,
>> is actually wrong because of ties. The problem is that on
>> ties, "round" should round away from zero, but there is
>> no such rounding mode, only the default, towards zero.
>
> True. Let's just document that round is the same as the C function rint.
> (Though, shouldn't it round to even?)
Yes, the round to nearest mode is round to even on ties. I am making
a more complete test case covering this...
>> BTW, is this really correct?
>> extr_d_f o1 o2<- convert float o2 to double o1
>> extr_i_d o1 o2<- convert int o2 to double o1
>>
>> If extr_i_d means "int to double", then extr_d_f
>> should mean "double to float", and not "float to double"
>
> Which it does:
>
> #define jit_extr_d_f(rd, rs) CVTSD2SSrr((rs), (rd))
>
> SD2SS = scalar double to scalar float.
>
> But for x87 they're all dummy.
On x86_64 I need to call jit_extr_d_f(JIT_FPR(x), JIT_FPR(y))
to pass it as printf argument, if it is a float, otherwise, nothing
is required.
>> (there is only FISTTP, no FISTT, so, need to
>> load the value, and pop it, instead of using
>> FXCH, but it should still be cheap as it does,
>> correctly, rounding towards zero on ties)
>
> FISTTP is not on all processors though. At this point, it's easier to use
> SSE(2) for 32-bit too.
I started to write some more code to handle x87. I did some
measures with this test tool script:
-%<-
.code 256
prolog 0
movi_d %f0 -0.5
movi_i %r0 10000000
loop:
ceilr_d_i %r1 %f0
subi_i %r0 %r0 1
bgei_i loop %r0 0
ret
-%<-
replacing ceil by the conversion being tested. My computer
is below average, and, still, I noticed timings like this:
- 0.05-0.09 a new implementation that also uses FCOMI (p6 or newer)
that does the dirty job of loading the fpu status word and setting
zero flag, overflow flag or parity flag.
- 0.09-0.12 the current code converted to an inline function as only
change.
- 0.25-0.28 a fully safe version, but very costly, that loads the control
word, adjusts the desired rounding mode, and reverts it at exit, but
still, should be a lot faster than a function call to libm's version.
I still did not push the change because I am still working on a
proper jit_roundr_x_y, and I aliased, at least for now, and only for
me, a new jit_rintr_x_y, to the previous jit_roundr_x_y.
cut&paste of a small chunk, to have an idea, and also start of logic
to have runtime configurability:
-%<-
__jit_inline void
_i386_floorr_d_i(jit_gpr_t r0, int f0)
{
jit_gpr_t aux = r0 == _RDX ? _RAX : _RDX;
PUSHLr(aux);
FLDr(f0);
SUBLir(8, _RSP);
FISTLm(0, _RSP, 0, 0); /* *esp = (int)*st */
FILDLm(0, _RSP, 0, 0); /* *--st = (float)(int)*esp */
FSUBRPr(1); /* st[1] = st[1] - *st, ++st */
FSTPSm(4, _RSP, 0, 0); /* *(esp + 4) = (int)*st, ++st */
POPLr(r0); /* r0 = rint(f0) */
POPLr(aux); /* aux = f0 - round(f0) */
ADDLir(0x7FFFFFFF, aux); /* carry if aux < 0 */
SBBLir(0, r0); /* subtract 1 if carry */
POPLr(aux);
}
__jit_inline void
_i686_floorr_d_i(jit_gpr_t r0, int f0)
{
/* make room for conversion */
PUSHLr(_RAX);
/* push value to x87 stack */
FLDr(f0);
/* round st(0) to nearest integer */
FRNDINT_();
/* compare st(0) to st(f0 + 1) and set eflags */
FCOMIr(f0 + 1);
/* store converted integer to stack and pop x87 one */
FISTPLm(0, _RSP, 0, 0);
/* pop converted integer */
POPLr(r0);
/* subtract 1 if carry */
SBBLir(0, r0);
}
#define jit_floorr_d_i(r0, f0) jit_floorr_d_i(r0, f0)
__jit_inline void
jit_floorr_d_i(jit_gpr_t r0, int f0)
{
if (jit_always_round_to_nearest()) {
if (jit_i686())
_i686_floorr_d_i(r0, f0);
else
_i386_floorr_d_i(r0, f0);
}
else
_safe_floor_d_i(r0, f0);
}
-%<-
_safe_floor_d_i is like the size of the entire chunk above,
not cut&pasting to save space :-)
> Paolo
Paulo