guile-devel
[Top][All Lists]

## Re: thoughts on native code

 From: Stefan Israelsson Tampe Subject: Re: thoughts on native code Date: Sun, 11 Nov 2012 20:28:04 +0100

Hi, hope that i'm not boring you!

Ok, The next effort will be to investigate the issue of cpu register allocations. Some features noted here is
that 1. some instruction takes vastly more resources than others and we need to track this, I'm leaning on implementing
a version of the linear scan algorithm. This is how I think.

The first issue is that I love to relate everything to costs. So basically if we manage to keep a local variable in a register, we would gain something e.g. g. On the other hand during the intervall of setting and retrieving the local variable we have performed a set of operations e.g w. If we find that g / w is large we would decide that we would gain so much during the interval that we should put it in a register. Or else we do not bother putting it in a register. Now over to a branching for a branch we split up the liveness interval in the number of segments between usages, for such a segment over a branch we would have

[g1, w1] -> { [g2,w2] , [g3,w3] }

Now generally we would like to assign some kind of probability to the different branches and do something like

F = p  * (g1 + g2) / (w1 + w2) + (1-p) * (g1 + g3) / (w1 + w3)

If we do not have any information we just put p = 1/2. The astute reader would realize that this mean a potential
geometrical explosion in the number of terms in the sum let's try

F = g1 / w1 + p (g2 / w2) + (1 -p) (g2/w3)

Nah that's not good either, if g1 is zero we would not amortize over w1, this leads to

g1' = (g1 + L ( p g2/w2 + (1-p) g3 / w3))
F  = g1'/w1

The linear updating will be
p = 1 => g1' = g1 + w1 L* g2/w2

and
F = g1/w1 + L/w1 * g2/w2

How to select L? consider take one w2 for which we think that this _expression_ should be true to the more theoretically compelling e.g.

F = g1/w + g2/(w1 + w2) = g1/w + L * g2/w2

And  we would then conclude to use L = w / (w1+w), and

F = g1/w1 + w / (w1 + w) * (g2 / w2)

Although i'm not certain this is the best, maybe you see a better solution, but the nice thing is that we can build up the solution recursively if we follows this scheme. and for code graph being a tree we could scan the tree only ones and build
up these numbers. This method can be refined.

Anyhow we can take all usages and sum up a F for the start and move forward at each segment for the whole liveness
of the local variable. Now do all this for all variables. Then employ a version of the linear scan algorithm but only select a variable for register allocation if the corresponding F is above a threshold C. Which threshold? Well you could  try to vary F, e.g. Let Q(C) = fraction of allocated registers per unit of time. then select C so that MIN(C s.t. Q(C) > 0.9).

We could refine this method to try vary C over the graph somewhat.

What do you think?

Happy Mathing!!
/Stefan

On Sat, Nov 10, 2012 at 11:06 PM, Stefan Israelsson Tampe wrote:
I would like to continue the discussion about native code.

Some facts are,
For example, consider this
(define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i 1)))))

The timings for (f 100000000)  ~ (f 100M) is

1) current vm                 : 2.93s
2) rtl                              : 1.67s
3) compressed native     : 1.15s
4) uncompressed native : 0.54s

sbcl = compressed nativ + better register allocations (normal optimization level) : 0.68s

To note is that for this example the call overhead is close to 5ns per iteration and meaning that
if we combined 4 with better register handling the potential is to get this loop to run at 0.2s which means
that the loop has the potential of running 500M iterations in one second without sacrifying safety and not
have a extraterestial code analyzer. Also to note is that the native code for the compressed native is smaller then the
rtl code by some factor and if we could make use of registers in a better way we would end up with even less overhead.

To note is that compressed native is a very simple mechanism to gain some speed and also improve on memory
usage in the instruction flow, Also the assembler is very simplistic and it would not be to much hassle to port a new
instruction format to that environment. Also it's probably possible to handle the complexity of the code in pure C
for the stubs and by compiling them in a special way make sure they output a format that can be combined
with the meta information in special registers needed to make the execution of the compiled scheme effective.

This study also shows that there is a clear benefit to be able to use the computers registers, and I think this is the way
you would like the system to behave in the end. sbcl does this rather nicely and we could look at their way of doing it.

So, the main question now to you is how to implement the register allocations? Basic principles of register allocation can be gotten out from the internet, I'm assure of, but the problem is how to handle the interaction with the helper stubs. That is
something i'm not sure of yet.

A simple solution would be to assume that the native code have a set of available registers r1,...,ri and then force the
compilation of the stubs to treat the just like the registers bp, sp, and bx. I'm sure that this is possible to configure in gcc.

So the task for me right now is to find out more how to do this, if you have any pointers or ideas, please help out.

Cheers
Stefan

On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe wrote:
Hi all,

After talking with Mark Weaver about his view on native code, I have been pondering how to best model our needs.

I do have a framework now that translates almost all of the rtl vm directly to native code and it do shows a speed increase of say 4x compared to runing a rtl VM. I can also generate rtl code all the way from guile scheme right now so It's pretty easy to generate test cases. The problem that Mark point out to is that we need to take care to not blow the instructuction cache. This is not seen in these simple examples but we need larger code bases to test out what is actually true. What we can note though is that I expect the size of the code to blow up with a factor of around 10 compared to the instruction feed in the rtl code.

One interesting fact is that SBCL does fairly well by basically using the native instruction as the instruction flow to it's VM. For example if it can deduce that a + operation works with fixnums it simply compiles that as a function call to a general + routine e.g. it will do a long jump to the + routine, do the plus, and longjump back essentially dispatching general instructions like + * / etc, directly e.g. sbcl do have a virtual machine, it just don't to table lookup to do the dispatch, but function call's in stead. If you count longjumps this means that the number of jumps for these instructions are double that of using the original table lookup methods. But for calling functions and returning functions the number of longjumps are the same and moving local variables in place , jumping  is really fast.

Anyway, this method of dispatching would mean a fairly small footprint with respect to the direct assembler. Another big chunk of code that we can speedup without to much bloat in the instruction cache is the lookup of pairs, structs and arrays, the reason is that in many cases we can deduce at compilation so much that we do not need to check the type all the time but can safely lookup the needed infromation.

Now is this method fast? well, looking a the sbcl code for calculating 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism above, it manages to sum 150M terms in one second, that's quite a feat for a VM with no JIT. The same with the rtl VM is 65M.

Now, sbcl's compiler is quite matured and uses native registers quite well which explains one of the reasons why the speed. My point is though that we can model efficiently a VM by call's and using the native instructions and a instructions flow.

Regards Stefan