[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lightning] More on work on lightning

From: Paulo César Pereira de Andrade
Subject: Re: [Lightning] More on work on lightning
Date: Mon, 27 Sep 2010 17:53:56 -0300

Em 27 de setembro de 2010 06:41, Paolo Bonzini <address@hidden> escreveu:
> On 09/26/2010 01:15 PM, Paulo César Pereira de Andrade wrote:
>> Em 26 de setembro de 2010 04:22, Paolo Bonzini<address@hidden>  escreveu:
>>> 2010/9/25 Paulo César Pereira de Andrade
>>> <address@hidden>:
>>>>  How to get forward/context information?
>>>> 2. Add a standard field to jit_state_t or jit_local_state to be filled
>>>>   by the programmer
>>> That's possible.  However, I think this does not belong in lightning
>>> at all.  lightning users could do inlining at a high level to ensure
>>> big enough subroutines are generated and the prolog overhead is not
>>> important.  Register allocation could be done at a higher level too,
>>> and so could constant propagation.
>>   What you suggest to "export" the 6 gpr argument registers and
>> the 8 xmm argument registers in x86_64? I mean, to do it in a
>> somewhat standard interface. One could just use the registers,
>> but maybe add a new JIT_ARG(n) and JIT_FPARG(n) and
>> JIT_ARG_NUM and JIT_FPARG_NUM to be tested for availability?
> What do you mean?
> Arguments should be pushed without knowledge of argument registers, even
> though that's not perfect when you have a register allocator that can do
> coalescing.

  I was thinking about it more on the sense of a function that receives
less arguments than those passed in registers, and in its body, the register
used for arguments are not used. But if an argument is known to not be
required anymore, the register could also be used for other purposes.

>>   Since the i386 code always did a "sub 12,%esp", I converted it into
>> an explicit 32 bit immediate (that is, do not call SUBLir, but inline the
>> instruction generation to not have it use the 8 bits immediate version),
>> and patch the value on the fly if need more space, and then, pass
>> arguments using jit_stxi_x from JIT_SP.
> Nice.  However, it does not work if you jump from one function to another
> (skipping the jit_prolog of the latter) to do tail-calling.

  I was thinking a bit about this. One problem is that there is no defined
function boundary, besides jit_prolog must be the first instruction, there
may be multiple calls to jit_ret, what may not be cheap; maybe should
have a jit_epilogue, but the user can just jump to the ret manually.

  Doing special handling after jit_leaf(), possibly just like before could
be an alternative. The state would be set on jit_prolog or jit_leaf.

  But if code is jumping from function to function, it would be better
to know the stack layout of where it is jumping to, for example, on
tail calling, one usually needs some kind of accumulator, that is
either in a register or in stack, or, possibly changing the argument
itself and jumping to the start of the function. I mean
tail call as in converting a recursive function into a loop. like:
fact(n) { if (n > 1) return fact(n-1)*n; return 1; }
fact(n) { m = 1; restart: if (n > 1) { m*=n; --n; goto restart; } return m; }

  Usually, there should be only issues, in pseudo assembler, in code like:

    jmp main
    prolog 0
    jmpi error_handler

in this example, prolog was not ever called before the error_handler
label, but if a function was defined before it, it would use that stack
layout, not main one.

>>   I also updated some code for float/double conversion/load to
>> make a jit_allocai call "on demand", and then, use jit_{ld,st}xi_x
>> from JIT_FP.
> Have you timed performance?  Stack operations are really really cheap on
> x86.

  I just timed it on 3 different computers, and there is no difference; timing
of some loops running 10 million times oscillate 1-5 percent to less or
more, so, effectively it is the same, just that the new approach should
make prettier assembly dump.

>>   That would be an option, probably a very good one, but probably
>> would break things badly because the same shared object needs
>> to call gmp/mpfr/X11/etc functions.
> regparm uses callee-saved registers for parameters and can be applied
> per-function.  It's not "-ffixed-xxx" which might change the ABI. However,
> there is a problem in that prepare/pusharg/finish does not understand the
> regparm calling convention.

  Ok. But on next actual work in using lightning in my language, probably it
would be better to have better or a good jump table logic, to somewhat
inline switches, and even better if it could use an schema like the
vm I through away to work on lightning (current code using lightning is still
more the 2 times slower...). The concept was that it used several different
"dispatch tables" so that it would only really check the "implicit value type"
when loading a new variable, and when changing it, it would jump to the
proper table. A small sample of how it used to be:

int_table[] = { ..., &&iadd, &&isub, ... };
float_table[] = { ..., &&fadd, &&fsub, ... };
    if (arg.t == t_float) {
        value.f = value.i + arg.f;
        goto *float_table[*pc++];
    tmp = value.i + arg.i;
    if (overflow) {
        value.z = make_bignum(value.i, arg.i);
        goto mpz_table[*pc++];
    value.i = tmp;
    goto *int_table[*pc++];

but it was already average 10+ times slower than equivalent gcc -O0
code when the above was only using integers, and addition of better
support for more basic types just made the above logic too much
complex to have combinations of {,u}int{8,16,32,64} and float{32,64},
and for statically typed variables, inline jit would be the better

>>   My interest on adding those was because they were just #if 0'ed, but
>> having a single opcode, possibly with some small tests is tempting, as
>> it also means registers are saved. But the cost is very high, it is
>> average
>> well over 100 cycles for sin/cos and some others. Add transfer using
>> stack and it becomes more costly. I wonder if there will be trigonometric
>> or transcendental, reliable, functions on x86+sse, as sse I believe (and
>> actually the name implies :-) is not really mean't for scientific
>> programming,
>> but for multimedia.
> No, that's not true anymore.  SSE is just floating-point math done right.
>  There are no trig/transcendental functions because it doesn't really make
> anymore much sense in modern microarchitectures, and doesn't guarantee
> correct results so it's tricky to use it.
> In code where performance really matters the compiler could do better by
> vectorizing loop, and calling functions doing a vector sine/cosine/log/exp.
>  However, where hardware helps is with reciprocal and square root, so there
> are instructions for that in SSE.

  Ok. If I could keep some states in xmm registers, it would help a lot,
but AFAIK there is no callee save xmm registers in the SytemV ABI. That
would also make it available 128 bit integers registers on 32 bits...

> Paolo


reply via email to

[Prev in Thread] Current Thread [Next in Thread]