lightning
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lightning] About using lightning on a dynamically typed language


From: Paolo Bonzini
Subject: Re: [Lightning] About using lightning on a dynamically typed language
Date: Sun, 16 May 2010 11:31:58 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.0.4-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.4

On 05/16/2010 11:09 AM, Paulo César Pereira de Andrade wrote:
Paolo Bonzini wrote:
  The language vm is currently implemented using computed gotos, and
I plan to add several "super instructions" to it to attempt to reduce
the cost of indirect jumps.

You can expect 15-40% performance improvement from that, depending on
the architecture (unfortunately 40% was on the Pentium 4...).

   I think it is dependent on the cpu, but a simple "noop" example:
-%<-
void test() {
     auto a, b;
     for (a = b = 0; a<  10000000; ++a)
         b += a;
     print("%d\n", b);
}
test();
-%<-
that becomes pseudo bytecode:
-%<-
test:
   enter 3   ;; # of stack slots (including unnamed push operand of add)
   int 0
   sd a
   sd b
L1:
   ld b
   push
   ld a
   add
   sd b
   ld a
   inc
   sd a
   push
   int 10000000
   lt
   jt L1
   ld b
   push
   literal "%d\n"
   builtin print 2
   ret
main:
   call test
   exit
-%<-

just by adding the extra opcodes:
   ld+push<arg>
   sd+push<arg>

it already gave almost 40% speedup for gcc -O3 compiled vm,

Ah, that's because it's not purely a stack machine. A more common design for a stack VM would have opcodes like "push <arg>" (one for each kind of <arg>"), "store <arg>" (one for each kind of <arg>"), "pop and store <arg>".

My 15%-40% figure was based on my experience in GNU Smalltalk (around 2003). There I switched from that design to a small set of opcodes (but still around ~30) where complex opcodes started at "pop and store <arg>", and I added 192 complex opcodes based on static analysis of a big body of code. Some of them of course were simply "pop and store <arg>", others were more complex like "pop/dup/push 1/add" which occurred in for loops.

Part of the advantage was because the new bytecode set was entirely made of 2-byte opcodes, thus making fetch/decode of bytecodes much faster too and almost pipelinable.

-O0 gives like sub 10%, but I am only testing on i686 and ia64.

-O0 performance doesn't really count, no?

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]