|
From: | Paolo Bonzini |
Subject: | Re: [Lightning] About using lightning on a dynamically typed language |
Date: | Sun, 16 May 2010 11:31:58 +0200 |
User-agent: | Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.0.4-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.4 |
On 05/16/2010 11:09 AM, Paulo César Pereira de Andrade wrote:
Paolo Bonzini wrote:The language vm is currently implemented using computed gotos, and I plan to add several "super instructions" to it to attempt to reduce the cost of indirect jumps.You can expect 15-40% performance improvement from that, depending on the architecture (unfortunately 40% was on the Pentium 4...).I think it is dependent on the cpu, but a simple "noop" example: -%<- void test() { auto a, b; for (a = b = 0; a< 10000000; ++a) b += a; print("%d\n", b); } test(); -%<- that becomes pseudo bytecode: -%<- test: enter 3 ;; # of stack slots (including unnamed push operand of add) int 0 sd a sd b L1: ld b push ld a add sd b ld a inc sd a push int 10000000 lt jt L1 ld b push literal "%d\n" builtin print 2 ret main: call test exit -%<- just by adding the extra opcodes: ld+push<arg> sd+push<arg> it already gave almost 40% speedup for gcc -O3 compiled vm,
Ah, that's because it's not purely a stack machine. A more common design for a stack VM would have opcodes like "push <arg>" (one for each kind of <arg>"), "store <arg>" (one for each kind of <arg>"), "pop and store <arg>".
My 15%-40% figure was based on my experience in GNU Smalltalk (around 2003). There I switched from that design to a small set of opcodes (but still around ~30) where complex opcodes started at "pop and store <arg>", and I added 192 complex opcodes based on static analysis of a big body of code. Some of them of course were simply "pop and store <arg>", others were more complex like "pop/dup/push 1/add" which occurred in for loops.
Part of the advantage was because the new bytecode set was entirely made of 2-byte opcodes, thus making fetch/decode of bytecodes much faster too and almost pipelinable.
-O0 gives like sub 10%, but I am only testing on i686 and ia64.
-O0 performance doesn't really count, no? Paolo
[Prev in Thread] | Current Thread | [Next in Thread] |