lightning
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lightning] About using lightning on a dynamically typed language


From: Paulo César Pereira de Andrade
Subject: Re: [Lightning] About using lightning on a dynamically typed language
Date: Sun, 16 May 2010 21:29:44 -0300
User-agent: SquirrelMail/1.4.19

Paolo Bonzini wrote:
> On 05/16/2010 11:09 AM, Paulo César Pereira de Andrade wrote:
>> Paolo Bonzini wrote:
>>>>   The language vm is currently implemented using computed gotos, and
>>>> I plan to add several "super instructions" to it to attempt to reduce
>>>> the cost of indirect jumps.
>>>
>>> You can expect 15-40% performance improvement from that, depending on
>>> the architecture (unfortunately 40% was on the Pentium 4...).
>>
>>    I think it is dependent on the cpu, but a simple "noop" example:
>> -%<-
>> void test() {
>>      auto a, b;
>>      for (a = b = 0; a<  10000000; ++a)
>>          b += a;
>>      print("%d\n", b);
>> }
>> test();
>> -%<-
>> that becomes pseudo bytecode:
>> -%<-
>> test:
>>    enter 3   ;; # of stack slots (including unnamed push operand of add)
>>    int 0
>>    sd a
>>    sd b
>> L1:
>>    ld b
>>    push
>>    ld a
>>    add
>>    sd b
>>    ld a
>>    inc
>>    sd a
>>    push
>>    int 10000000
>>    lt
>>    jt L1
>>    ld b
>>    push
>>    literal "%d\n"
>>    builtin print 2
>>    ret
>> main:
>>    call test
>>    exit
>> -%<-
>>
>> just by adding the extra opcodes:
>>    ld+push<arg>
>>    sd+push<arg>
>>
>> it already gave almost 40% speedup for gcc -O3 compiled vm,
>
> Ah, that's because it's not purely a stack machine.  A more common
> design for a stack VM would have opcodes like "push <arg>" (one for each
> kind of <arg>"), "store <arg>" (one for each kind of <arg>"), "pop and
> store <arg>".

  This was intended to have a small instruction set, and still it
is a good pattern for complex operations where the load/stores can
also be a vector element or record field. But not much good for
loops, as in loops it is better to not implicitly pop the stack,
but have extra temporaries. But yes, the concept you describe would
be already having a good share of the "super instructions" to start :-)

> My 15%-40% figure was based on my experience in GNU Smalltalk (around
> 2003).  There I switched from that design to a small set of opcodes (but
> still around ~30) where complex opcodes started at "pop and store
> <arg>", and I added 192 complex opcodes based on static analysis of a
> big body of code.  Some of them of course were simply "pop and store
> <arg>", others were more complex like "pop/dup/push 1/add" which
> occurred in for loops.

  I developed most of the code in an Amd Sempron, and noticed that
the equivalent -O0 C code, eg, the above example with s/auto/long long/
and calling test() from main(), the vm would run 6-8 times slower,
but on a generic Athlon 64 the -O0 code would run 20 times faster.

> Part of the advantage was because the new bytecode set was entirely made
> of 2-byte opcodes, thus making fetch/decode of bytecodes much faster too
> and almost pipelinable.
>
>> -O0 gives like sub 10%, but I am only testing on i686 and ia64.
>
> -O0 performance doesn't really count, no?

  Yes, I said it mainly to have an idea of the reduced cost of the
super/composed opcode in computing time, when it removes an indirect
jump and allowing extra optimizations in the longer code block.



  Sorry for showing my ignorance, but what would be the best way to
see the generated jit as assembly output? I think I would need that
to better map "in my little mind" what code is being generated...

  I made a small test, with a very simple program, but I just get
core dumps. I am surely doing something dumb (probably wrong
dereference or register clobber), but if you can give some insight
I would appreciate.

-%<-
#include <stdio.h>
#include <stdlib.h>
#include <lightning.h>

#define NOOP    0
#define LOAD    1
#define STORE   2
#define PUSH    3
#define ADD     4
#define INC     5
#define LT      6
#define J       7
#define JT      8
#define IMM     9
#define END     10
int program[] = {
    IMM,        0,              STORE,          0,
    STORE,      1,/*L:*/        LOAD,           1,
    PUSH,       LOAD,           0,              ADD,
    STORE,      1,              LOAD,           0,
    INC,        STORE,          0,              PUSH,
    IMM,        0x989680,       LT,             JT,
    -17,        END
};

typedef struct thread {
    int *ss, *bp, *sp;
} thread_t;

typedef int (*pift_t)(thread_t *);

pift_t compile(int *program, int length) {
    int          *base;
    jit_insn     *code;
    jit_insn    **labels;
    int           thread;

    /* random guess about size */
    code = malloc(length * 32 * sizeof(int));

    /* XXX no label information here... */
    labels = malloc(length * sizeof(jit_insn*));

    jit_set_ip(code);

    jit_prolog(1);
    thread = jit_arg_i();
    jit_getarg_i(JIT_V1, thread);       /* v1 = thread */

    for (base = program; length > 0; --length, ++program) {
        labels[program - base] = jit_get_label();
        switch (*program) {
            case NOOP:
                break;
            case LOAD:
                ++program;
                /* r1 = thread->ss */
                jit_ldxi_i(JIT_R1, JIT_V1, (int)&((thread_t*)0)->bp);
                /* r0 = r1[*program] */
                jit_ldxi_i(JIT_R0, JIT_R1, *program * sizeof(int));
                break;
            case STORE:
                ++program;
                /* r1 = thread->ss */
                jit_ldxi_i(JIT_R1, JIT_V1, (int)&((thread_t*)0)->bp);
                /* r1[*program] = r0 */
                jit_stxi_i(*program * sizeof(int), JIT_R1, JIT_R0);
                break;
            case PUSH:
                /* r1 = thread->sp */
                jit_ldxi_i(JIT_R1, JIT_V1, (int)&((thread_t*)0)->sp);
                /* r2 = r1 + sizeof(int) */
                jit_addi_i(JIT_R2, JIT_R1, sizeof(int));
                /* thread->sp = r2 */
                jit_stxi_i((int)&((thread_t*)0)->sp, JIT_V1, JIT_R2);
                /* *r1 = r0 */
                jit_stxi_i(0, JIT_R1, JIT_R0);
                break;
            case ADD:
                /* r1 = thread->sp */
                jit_ldxi_i(JIT_R1, JIT_V1, (int)&((thread_t*)0)->sp);
                /* r2 = r1 - sizeof(int) */
                jit_subi_i(JIT_R2, JIT_R1, sizeof(int));
                /* r1 = *r1 */
                jit_ldi_i(JIT_R1, 0);
                /* r0 = r1 + r0 */
                jit_addr_i(JIT_R0, JIT_R1, JIT_R0);
                /* thread->sp = r2 */
                jit_stxi_i((int)&((thread_t*)0)->sp, JIT_V1, JIT_R2);
                break;
            case INC:
                /* r0 = r0 + 1 */
                jit_addi_i(JIT_R0, JIT_R0, 1);
                break;
            case LT:
                /* r1 = thread->sp */
                jit_ldxi_i(JIT_R1, JIT_V1, (int)&((thread_t*)0)->sp);
                /* r2 = r1 - sizeof(int) */
                jit_subi_i(JIT_R2, JIT_R1, sizeof(int));
                /* r1 = *r1 */
                jit_ldi_i(JIT_R1, 0);
                /* r0 = r1 < r0 */
                jit_lti_i(JIT_R0, JIT_R1, JIT_R0);
                /* thread->sp = r2 */
                jit_stxi_i((int)&((thread_t*)0)->sp, JIT_V1, JIT_R2);
                break;
            case J:
                /* XXX no forward jumps in sample test */
                ++program;
                jit_jmpi(labels[program - base + *program]);
                break;
            case JT:
                /* XXX no forward jumps in sample test */
                ++program;
                jit_bnei_ui(labels[program - base + *program], JIT_R0, 0);
                break;
            case IMM:
                /* r0 = immediate */
                ++program;
                jit_movi_i(JIT_R0, *program);
                break;
            case END:
                jit_retval(JIT_R0);
                jit_ret();
                jit_flush_code(code, jit_get_ip().ptr);
                return (pift_t)(jit_get_ip().iptr);
        }
    }
    abort();
}

void vm(void) {
    pift_t      jit;
    thread_t    thread;

    thread.ss = malloc(16 * sizeof(int));
    thread.bp = thread.sp = thread.ss;
    jit = compile(program, sizeof(program) / sizeof(int));
    printf("%d\n", (*jit)(&thread));
}

int
main(int argc, char *argv[])
{
    vm();
    return 0;
}
-%<-
This is a variant of some tests I made, converted, for test purposes
to use int instead of long long (and that may be tricky as could need
to implement 64 bits arithmetic on 32 bits cpus and keep track of
lightning registers).

> Paolo

Paulo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]