[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

guile 3 update, september edition

From: Andy Wingo
Subject: guile 3 update, september edition
Date: Mon, 17 Sep 2018 10:25:34 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)


This is an update on progress towards Guile 3.  In our last update, we
saw the first bits of generated code:

Since then, the JIT is now feature-complete.  It can JIT-compile *all*
code in Guile, including delimited continuations, dynamic-wind, all
that.  It runs automatically, in response to a function being called a
lot.  It can also tier up from within hot loops.

The threshold at which Guile will automatically JIT-compile is set from
the GUILE_JIT_THRESHOLD environment variable.  By default it is 50000.
If you set it to -1, you disable the JIT.  If you set it to 0, *all*
code will be JIT-compiled.  The test suite passes at
GUILE_JIT_THRESHOLD=0, indicating that all features in Guile are
supported by the JIT.  Set the GUILE_JIT_LOG environment variable to 1
or 2 to see JIT progress.

For debugging (single-stepping, tracing, breakpoints), Guile will fall
back to the bytecode interpreter (the VM), for the thread that has
debugging enabled.  Once debugging is no longer enabled (no more hooks
active), that thread can return to JIT-compiled code.

Right now the JIT-compiled code exactly replicates what the bytecode
interpreter does: the same stack reads and writes, etc.  There is some
specialization when a bytecode has immediate operands of course.
However the choice to do debugging via the bytecode interpreter --
effectively, to always have bytecode around -- will allow machine code
(compiled either just-in-time or ahead-of-time) to do register
allocation.  JIT will probably do a simple block-local allocation.  An
AOT compiler is free to do something smarter.

As far as I can tell, with the default setting of
GUILE_JIT_THRESHOLD=50000, JIT does not increase startup latency for any
workload, and always increases throughput.  More benchmarking is needed

Using GNU Lightning has been useful but in the long term I don't think
it's the library that we need, for a few reasons:

  * When Lightning does a JIT compilation, it builds a graph of
    operations, does some minor optimizations, and then emits code.  But
    the graph phase takes time and memory.  I think we just need a
    library that just emits code directly.  That would lower the cost of
    JIT and allow us to lower the default GUILE_JIT_THRESHOLD.

  * The register allocation phase in Lightning exists essentially for
    calls.  However we have a very restricted set of calls that we need
    to do, and can do the allocation by hand on each architecture.  This
    (We don't use CPU call instructions for Scheme function calls
    because we use the VM stack.  We might be able to revise this in the
    future but again Lightning is in the way).  Doing it by hand would
    allow a few benefits:

      * Hand allocation would free up more temporary registers.  Right
        now lightning reserves all registers used as part of the platform
        calling convention; they are unavailable to the JIT.

      * Sometimes when Lightning needs a temporary register, it can
        clobber one that we're using as part of an internal calling
        convention.  I believe this is fixed for x86-64 but I can't be
        sure for other architectures!  See commit

      * We need to do our own register allocation; having Lightning also
        do it is a misfeature.

  * Sometimes we know that we can get better emitted code, but the
    lightning abstraction doesn't let us do it.  We should allow
    ourselves to punch through that abstraction.

The platform-specific Lightning files basically expose most of the API
we need.  We could consider incrementally punching through lightning.h
to reach those files.  Something to think about for the future.

Finally, as far as performance goes -- we're generally somewhere around
80% faster than 2.2.  Sometimes more, sometimes less, always faster
though AFAIK.  As an example, here's a simple fib.scm:

   $ cat /tmp/fib.scm
   (define (fib n)
     (if (< n 2)
         (+ (fib (- n 1))
            (fib (- n 2)))))

Now let's use eval-in-scheme to print the 35th fibonacci number.  For
Guile 2.2:

   $ time /opt/guile-2.2/bin/guile -c \
       '(begin (primitive-load "/tmp/fib.scm") (pk (fib 35)))'

   ;;; (14930352)

   real 0m9.610s
   user 0m10.547s
   sys  0m0.040s

But with Guile from the lightning branch, we get:

   $ time /opt/guile/bin/guile -c \
       '(begin (primitive-load "/tmp/fib.scm") (pk (fib 35)))'

   ;;; (14930352)

   real 0m5.299s
   user 0m6.167s
   sys  0m0.064s

Meaning that "eval" in Guile 3 is somewhere around 80% faster than in
Guile 2.2 -- because "eval" is now JIT-compiled.  (Otherwise it's the
same program.)  This improves bootstrap times, though Guile 3's compiler
will generally make more CPS nodes than Guile 2.2 for the same
expression, which takes more time and memory, so the gain isn't

Incidentally, as a comparison, Guile 2.0 (whose "eval" is slower for
various reasons) takes 70s real time for the same benchmark.  Guile 1.8,
whose eval was written in C, takes 4.536 seconds real time.  It's still
a little faster than Guile 3's eval-in-Scheme, but it's close and we're
catching up :)

I have also tested with ecraven's r7rs-benchmarks and we make a nice
jump past the 2.2 results; but not yet at Racket or Chez levels yet.  I
think we need to tighten up our emitted code.  There's another 2x of
performance that we should be able to get with incremental improvements.
For the last bit we will need global register allocation though, I

I think I'm ready to merge to "master"; it's currently just in the
"lightning" branch.  Disable by passing --disable-jit to configure.
Tests welcome from non-x86-64 architectures.

Happy hacking,


reply via email to

[Prev in Thread] Current Thread [Next in Thread]