lightning
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lightning] More on work on lightning


From: Paulo César Pereira de Andrade
Subject: Re: [Lightning] More on work on lightning
Date: Sun, 26 Sep 2010 08:15:41 -0300

Em 26 de setembro de 2010 04:22, Paolo Bonzini <address@hidden> escreveu:
> 2010/9/25 Paulo César Pereira de Andrade
> <address@hidden>:
>>  How to get forward/context information?
>> 2. Add a standard field to jit_state_t or jit_local_state to be filled
>>   by the programmer
>
> That's possible.  However, I think this does not belong in lightning
> at all.  lightning users could do inlining at a high level to ensure
> big enough subroutines are generated and the prolog overhead is not
> important.  Register allocation could be done at a higher level too,
> and so could constant propagation.

  What you suggest to "export" the 6 gpr argument registers and
the 8 xmm argument registers in x86_64? I mean, to do it in a
somewhat standard interface. One could just use the registers,
but maybe add a new JIT_ARG(n) and JIT_FPARG(n) and
JIT_ARG_NUM and JIT_FPARG_NUM to be tested for availability?

>>    printf("%d\n", JIT_R0);
>> =>
>>    subi_l %sp %sp 12
>>    str_i %sp %r0
>>    movi_p %r0 "%d\n"
>>    pushr_l %r0
>>    calli @printf
>>    addi %sp %sp 16
>>
>> it could have allocated stack in prolog for all calls done in
>> the function, and the above could have became:
>>    stxi_i 4 %sp %r0
>>    movi_p %r0 "%d\n"
>>    str_p %r0 %sp
>>    calli @printf
>
> As a code-size efficiency concern I can share it.  But regarding code
> performance, I think there's hardly a difference between the two.
>
> Using the alignment padding more efficiently however is a good idea.

  Actually, this should be a somewhat simple problem :-)

  Since the i386 code always did a "sub 12,%esp", I converted it into
an explicit 32 bit immediate (that is, do not call SUBLir, but inline the
instruction generation to not have it use the 8 bits immediate version),
and patch the value on the fly if need more space, and then, pass
arguments using jit_stxi_x from JIT_SP.

  I also updated some code for float/double conversion/load to
make a jit_allocai call "on demand", and then, use jit_{ld,st}xi_x
from JIT_FP. push/pop for general purpose registers should be
cheaper and generate less code anway, so, for now only do it
for floats and doubles.

  I will try a similar approach on x86_64, but there it may be easier
to keep it as is, and have the current logic as a penalty for using
more than 6 integer and/or 8 float arguments :-)

>> it is even more appealing when there are several sequential
>> function calls (like in my interpreter that currently uses
>> lightning mostly to glue calls to C functions).
>
> Why not use regparm(3) calling convention on i386 instead?

  That would be an option, probably a very good one, but probably
would break things badly because the same shared object needs
to call gmp/mpfr/X11/etc functions.

>>  Another issue is things like:
>> foo(0,0,0,0);
>> =>
>>    prepare 4
>>    xorr_i %r0 %r0
>>    pushr_i %r0
>>    xorr_i %r0 %r0
>>    pushr_i %r0
>>    xorr_i %r0 %r0
>>    pushr_i %r0
>>    xorr_i %r0 %r0
>>    pushr_i %r0
>>    finish @foo
>>    addi_l %sp %sp 16
>>
>> but this requires significant extra information to understand
>> what is going on.
>
> You can do constant propagation at a higher level for this, but
> however again I don't think there's anything to worry about regarding
> performance..

  Ok. I agree that this one is too much pedantic, and actually
is a lot easier to do in the calling code.

>>  Another somewhat unrelated comments is about the
>> initial trigonometric functions using x87. Probably they
>> are not (easily?) available on other cpus
>
> Not at all, actually.  And the inter-unit moves probably make them not
> so much faster compared to libc's sin, which you have to use anyway on
> non-x86.

  My interest on adding those was because they were just #if 0'ed, but
having a single opcode, possibly with some small tests is tempting, as
it also means registers are saved. But the cost is very high, it is average
well over 100 cycles for sin/cos and some others. Add transfer using
stack and it becomes more costly. I wonder if there will be trigonometric
or transcendental, reliable, functions on x86+sse, as sse I believe (and
actually the name implies :-) is not really mean't for scientific programming,
but for multimedia.

> I'm sorry that I disagree on (almost) everything you wrote in this
> mail, it's usually not the case. :)

  No problems, and many thanks for reading and replying. Actually,
when I read back my message, I myself think I wrote mostly gibberish :-)

> Paolo

Thanks,
Paulo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]