guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Register VM WIP


From: Noah Lavine
Subject: Re: Register VM WIP
Date: Wed, 16 May 2012 10:54:37 -0400

Hi Mark,

You are thinking along very similar lines to how I used to think. But
I have a different way to think about it that might make it seem
better.

In our current VM, we have two stacks: the local-variable stack, which
has frames for different function calls and is generally what you'd
think of as a stack, and the temporary-variable stack, which is
literally a stack in the sense that you only operate on the top of it.
The temporary-variable stack makes us do a lot of unnecessary work,
because we have to load things from the local-variable stack to the
temporary-variable stack.

I think what Andy is proposing to do is to get rid of the
temporary-variable stack and operate directly on the local-variable
stack. We shouldn't think of these registers as being like machine
registers, and in fact maybe "registers" is not a good name for these
objects. They are really just variables in the topmost stack frame.
This should only reduce memory usage, because the local-variable stack
stays the same and the temporary-variable stack goes away (some
temporaries might move to the local-variable stack, but it can't be
more than were on the temporary-variable stack, so that's still a
win).

The machine I was initially thinking of, and I imagine you were too,
is different. I had imagined a machine where the number of registers
was limited, ideally to the length of a processor cache line, and was
separate from the local-variables stack. In such a machine, the
registers are used as a cache for the local variables, and you get to
deal with all the register allocation problems that a standard
compiler would. That would accomplish the goal of keeping more things
in cache.

The "registers as cache" idea may result in faster code than the
"directly addressing local variables" idea, but it's also more
complicated to implement. So it makes sense to me that we would try
directly addressing local variables first, and maybe later move to
using a fixed-size cache of registers. It also occurs to me that the
RTL intermediate language, which is really just a language for
directly addressing an arbitrary number of local variables, is a
standard compiler intermediate language. So it might be useful to have
that around anyway, because we could more easily feed its output into,
for instance, GCC.

Andy, is this an accurate description of the register VM? And Mark and
everyone else, does it seem better when you look at it this way?

Noah

On Wed, May 16, 2012 at 9:44 AM, Mark H Weaver <address@hidden> wrote:
> Hi Andy!
>
> Andy Wingo <address@hidden> writes:
>> On Wed 16 May 2012 06:23, Mark H Weaver <address@hidden> writes:
>>
>>> It's surprising to me for another reason: in order to make the
>>> instructions reasonably compact, only a limited number of bits are
>>> available in each instruction to specify which registers to use.
>>
>> It turns out that being reasonably compact isn't terribly important --
>> more important is the number of opcodes it takes to get something done,
>> which translates to the number of dispatches.  Have you seen the "direct
>> threading" VM implementation strategy?  In that case the opcode is not
>> an index into a jump table, it's a word that encodes the pointer
>> directly.  So it's a word wide, just for the opcode.  That's what
>> JavaScriptCore does, for example.  The opcode is a word wide, and each
>> operand is a word as well.
>>
>> The design of the wip-rtl VM is to allow 16M registers (24-bit
>> addressing).  However many instructions can just address 2**8 registers
>> (8-bit addressing) or 2**12 registers (12-bit addressing).  We will
>> reserve registers 253 to 255 as temporaries.  If you have so many
>> registers as to need more than that, then you have to shuffle operands
>> down into the temporaries.  That's the plan, anyway.
>
> I'm very concerned about this design, for the same reason that I was
> concerned about NaN-boxing on 32-bit platforms.  Efficient use of memory
> is extremely important on modern architectures, because of the vast (and
> increasing) disparity between cache speed and RAM speed.  If you can fit
> the active set into the cache, that often makes a profound difference in
> the speed of a program.
>
> I agree that with VMs, minimizing the number of dispatches is crucial,
> but beyond a certain point, having more registers is not going to save
> you any dispatches, because they will almost never be used anyway.
> 2^12 registers is _far_ beyond that point.
>
> As I wrote before concerning NaN-boxing, I suspect that the reason these
> memory-bloated designs are so successful in the JavaScript world is that
> they are specifically optimized for use within a modern web browser,
> which is already a memory hog anyway.  Therefore, if the language
> implementation wastes yet more memory it will hardly be noticed.
>
> If I were designing this VM, I'd work hard to allow as many loops as
> possible to run completely in the cache.  That means that three things
> have to fit into the cache together: the VM itself, the user loop code,
> and the user data.  IMO, the sum of these three things should be made as
> small as possible.
>
> I certainly agree that we should have a generous number of registers,
> but I suspect that the sweet spot for a VM is 256, because it enables
> more compact dispatching code in the VM, and yet is more than enough to
> allow a decent register allocator to generate good code.
>
> That's my educated guess anyway.  Feel free to prove me wrong :)
>
>    Regards,
>      Mark



reply via email to

[Prev in Thread] Current Thread [Next in Thread]