[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: guile 3 update, june 2018 edition
Re: guile 3 update, june 2018 edition
Thu, 05 Jul 2018 19:05:02 +0200
Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux)
On Mon 02 Jul 2018 11:28, address@hidden (Ludovic Courtès) writes:
> Andy Wingo <address@hidden> skribis:
>> My current plan is that the frame overhead will still be two slots: the
>> saved previous FP, and the saved return address. Right now the return
>> address is always a bytecode address. In the future it will be bytecode
>> or native code. Guile will keep a runtime routine marking regions of
>> native code so it can know if it needs to if an RA is bytecode or native
>> code, for debugging reasons; but in most operation, Guile won't need to
>> know. The interpreter will tier up to JIT code through an adapter frame
>> that will do impedance matching over virtual<->physical addresses. To
>> tier down to the interpreter (e.g. when JIT code calls interpreted
>> code), the JIT will simply return to the interpreter, which will pick up
>> state from the virtual IP, SP, and FP saved in the VM state.
> What will the “adapter frame” look like?
Aah, sadly it won't work like this. Somehow I was thinking of an
adapter frame on the C stack. However an adapter frame corresponds to a
continuation, so it would have to have the life of a continuation, so it
would have to be on the VM stack. I don't think I want adapter frames
on the VM stack, so I have to scrap this. More below...
>> We do walk the stack from Scheme sometimes, notably when making a
>> backtrace. So, we'll make the runtime translate the JIT return
>> addresses to virtual return addresses in the frame API. To Scheme, it
>> will be as if all things were interpreted.
> Currently you can inspect the locals of a stack frame. Will that be
> possible with frames corresponding to native code? (I suppose that’d be
Yes, because native code manipulates the VM stack in exactly the same
way as bytecode. Eventually we should do register allocation and avoid
always writing values to the stack, but that is down the road.
>> My current problem is knowing when a callee has JIT code. Say you're in
>> JITted function F which calls G. Can you directly jump to G's native
>> code, or is G not compiled yet and you need to use the interpreter? I
>> haven't solved this yet. "Known calls" that use call-label and similar
>> can of course eagerly ensure their callees are JIT-compiled, at
>> compilation time. Unknown calls are the problem. I don't know whether
>> to consider reserving another word in scm_tc7_program objects for JIT
>> code. I have avoided JIT overhead elsewhere and would like to do so
>> here as well!
> In the absence of a native code pointer in scm_tc7_program objects, how
> will libguile find the native code for a given program?
This is a good question and it was not clear to me when I wrote this! I
think I have a solution now but it involves memory overhead. Oh well.
Firstly, I propose to add a slot to stack frames. Stack frames will now
store the saved FP, the virtual return address (vRA), and the machine
return address IP (mRA). When in JIT code, a return will check if the
mRA is nonzero, and if so jump to that mRA. Otherwise it will return
from JIT, and the interpreter should continue.
Likewise when doing a function return from the interpreter and the mRA
is nonzero, the interpreter should return by entering JIT code to that
When building an interpreter-only Guile (Guile without JIT) or an
AOT-only Guile (doesn't exist currently), we could configure Guile to
not reserve this extra stack word. However that would be a different
ABI: a .go file built with interpreter-only Guile wouldn't work on
Guile-with-JIT, because interpreter-only Guile would think stack frames
only need two reserved words, whereas Guile-with-JIT would write three
words. To avoid the complication, for 3.0 I think we should just use
3-word frames all the time.
So, that's returns. Other kinds of non-local returns like
abort-to-prompt, resuming delimited continuations, or calling
undelimited continuations would work similarly: the continuation would
additionally record an mRA, and resuming would jump there instead, if
Now, calls. One of the reasons that I wanted to avoid an extra program
word was because scm_tc7_program doesn't exist in a one-to-one
relationship with code. "Well-known" procedures get compiled by closure
optimization to be always called via call-label or tail-call-label -- so
some code doesn't have program objects. On the other hand, closures
mean that some code has many program objects.
So I thought about using side tables indexed by code; or inline
"maybe-tier-up-here" instructions, which would reference a code pointer
location, that if nonzero, would be the JIT code.
However I see now that really we need to optimize for the JIT-to-JIT
call case, as by definition that's going to be the hot case. Of course
call-label from JIT can do an unconditional jmp. But calling a program
object... how do we do this? This is complicated by code pages being
read-only, so we don't have space to store a pointer in with the code.
I think the answer is, like you say, another word in program objects.
JIT code will then do, when it sees a call:
if (SCM_PROGRAM_P (x) && SCM_PROGRAM_MCODE (x))
goto *SCM_PROGRAM_MCODE (x);
vp->ip = SCM_PROGRAM_CODE (x);
return; // Handle call in interpreter.
which is reasonably direct and terse. I can't think of any other option
that would be reasonably fast.
Now, we're left with a couple of issues: what to do about call-label in
interpreted code? Also: when should we JIT?
Let's take the second question first. I think we should JIT based on
counters, i.e. each function has an associated counter, and when that
counter overflows (or underflows, depending on how we configure it),
then we JIT that function. This has the advantage of being
deterministic. Another option would be statprof-like profiling, which
is cool because it measures real run time, but beside not being terrible
portable, it isn't deterministic, so it makes bugs hard to reproduce.
Another option would be a side hash table keyed by code address. This
is what LuaJIT does but you get collisions and since addresses are
randomized, that also makes JIT unpredictable.
Usually your counter limit is set depending on how expensive it is to
JIT. If JIT were free the counter limit would be 0. Guile's JIT will
be cheap, though not free. Additionally some JITs set higher counter
values because they do some form of adaptive optimization and they want
its first tier at 500, incrementing by 15 for each call, 10 for each
return, and 1 for each loop iteration. But LuaJIT sets the threshold at
56, incrementing by 1 each call and 2 each loop. Weird, right? Anyway
that's how it is. We'll have to measure.
Note that there are two ways you might want to tier up to JIT from the
interpreter -- at function entry, and in loops. Given that JIT code and
interpreter code manipulates the stack in the same way, you *can* tier
up anywhere -- but those are the places that it's worth checking.
Function entry is the essential place. That's when you have the
scm_tc7_program in slot 0, if that's how the procedure is compiled, and
that's when you can patch the JIT code slot in the program object.
The counter itself should be statically allocated in the image rather
than existing as part of the scm_tc7_program object, because multiple
closures can share code. So, that to me says:
* we make a pass in the compiler in a late stage to determine which
function entry labels will be ever put in a scm_tc7_program
* we add "enter-function" opcodes to the beginning of those functions'
code that will increment a statically allocated counter, maybe tier
up and patch the program object.
* the next call to those functions will see the JIT code in the
program and jump in directly
For nests of functions that call each other via call-label, I was
thinking that these functions should be compiled all at once. But in
the case of a big nested function where many things are visible -- the
style of programming used in the fastest program -- then there we could
cause too much latency, by biting off too much at once. So perhaps
there we should use enter-function as well, just with a flag that this
function has no closure so there's nothing to patch. If, when
compiling, we see a call-label, we can speculatively see if there's JIT
code for the callee already, and in that case emit a direct jump; but
otherwise we can instead emit an indirect jump. The JIT code pointer
would then be not in the scm_tc7_program, but statically allocated in
In fact probably we want statically allocated JIT code pointers in the
image anyway so that when one closure tiers up, the next closures tier
up "for free" on their next call just by checking that pointer.
Tier-up sites inside functions are ephemeral: they are typically only
entered once. The next time the function is entered it's via JIT,
usually. So, no need to cache them, we can just tier up the function
(ensuring its corresponding statically allocated JIT pointer is set),
then jump to the offset in the JIT code corresponding to that bytecode
Probably we can make "handle-interrupts" do this tier-up.
OK this mail is long enough!!! Does it make things clearer though?
* adapt compiler and runtime for 3-word stack frames
* adapt compiler and runtime for extra word in scm_tc7_program
* allocate counter and code pointer for each function
* adapt compiler and runtime to insert opcode at program entry (maybe
it can run the apply hook!), with flag indicating whether argv
might need patching
* re-take JIT implementation
ps. I am thinking of calling bytecode "vcode" in the future, for
virtual code, and JIT code "mcode", for machine code. WDYT? :)