[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Chicken-hackers] [PATCH] More efficient numbers integration through "sc
[Chicken-hackers] [PATCH] More efficient numbers integration through "scratchspace"
Sun, 19 Apr 2015 21:01:56 +0200
[warning: this mail is rather long]
I managed to drastically reduce the overhead of the full numeric
tower in core by introducing a new memory allocation area: the
"scratch space". As the name suggests, this is an area where
objects are temporarily stored until they can be re-integrated
in the normal GC. Find it in the "numbers-integration-scratchspace"
branch. This branch builds upon my earlier "numbers-integration"
branch. I have been rebasing like crazy on both branches, so you
will need to reset either branch if you've checked it out before.
This work took a lot of energy from me, and while there's still
some more stuff that could be done, I can't/don't want to continue
with it right now, and I certainly can't justify putting any more
energy in it if it isn't going to be merged. Having said that,
I'm quite happy with the current state of the project. I think
it is ready to be merged in: the remaining tasks, should we decide
to do them, are just minor improvements. More on that later, but
let's first explain how it works!
The goal is that if you're not using extended numbers, you should not
need to pay a large overhead cost because of the simple fact that they
exist (thank you Felix, for pushing this idea so strongly!). So, the
functions for addition, subtraction, multiplication and the bitwise
operators are now inlineable once again. This means that they use
a fixed amount of memory set aside on the stack, exactly as in the
master branch (except the amounts are a larger, in extreme cases
even 10 times as large - going from 4 words to 40 words). When you
first create a bignum, a scratch space is created through "malloc",
and the bignum is placed there.
The scratch area is seen as an extension of the nursery: the
C_demand() and C_stack_probe() checks will add the size of the scratch
space to the stack's size, and trigger a minor GC when the sum of the
two exceeds the maximum nursery size.
Because a new memory region needs to be integrated with the garbage
collector, we have to deal with two main challenges, and this is where
the nitty-gritty details get pretty tricky, even though the main idea
is very simple:
1) When performing heavy calculations, a lot of garbage numbers are
generated, and we need to somehow know which ones are still "live"
and which can be cleaned up, so we won't run out of memory. However,
in an inline function we don't have a continuation/closure root object
from which we can trace the live variables.
2) If the scratch space is filled up and needs to be reallocated, we
need to copy the numbers to a newly allocated area, and update all
the (live) pointers to those numbers. Unfortunately, an object is
referred to by a machine word, and there can be a semi-infinite number
of references to a bignum, which all need to be updated. Because
we're not performing a "proper" GC, we need to somehow update all of
This has been solved by adding a level of indirection: bignums are now
represented exactly like in the "numbers" egg, as a record that contains
two slots: the type (##sys#bignum) and a string that holds the binary
representation of the bignum's digit words. When a bignum is allocated,
the record itself is _always_ allocated on the stack, and the string is
allocated in the scratch space. The memory location of the slot
containing the string is recorded in scratch space, in a machine word
just preceding the string. This "back pointer" functions in a way that
looks a lot like the mutation stack: it remembers what slots need to be
looked at when reallocating. Because numbers are "atomic" in Scheme, we
can also assume that nothing else is going to mutate this slot.
Therefore, whenever we need to reallocate the scratch space, we can
simply move the bignum's digits string, then look at the word preceding
it and interpret it as a pointer to the slot. Then we can check whether
the slot is still pointing back to the bignum (it should be!), and update
it to the string's new memory location.
To visualise this, the memory layout of the scratch space looks like this:
LO | size_0 | slot_0 | ..digits.. | ... | size_n | slot_n | ..digits.. | HI
If we take "LO" to be the starting address of the scratch space and look at
the first word that's stored there, we first see a "size_0" which indicates
how many words the first object uses. If we calculate (LO + size_0 + 2),
we reach the location of size_1, and so on, until we reach the address of
size_n. If we add size_n + 2 to the address of size_n, we go beyond "HI",
which is the scratch space's upper bound (stored in C_scratchspace_limit).
When bignum X must be "freed" (which is done manually in the algorithms
written in C), we can simply update slot_X to NULL. This way, when the
next reallocation of the scratch space occurs, we can simply move over
this object and skip copying it to the new scratch space.
We also need to do some bookkeeping so that when a function returns,
it will transfer all its "live" objects to buffers allocated by the
caller, and "free" any objects that are no longer needed, even if the
next function's local data would overwrite the stack anyway. My initial
implementation didn't do that, but I quickly ran into trouble, because
sometimes another machine word may end up exactly on the place where a
bignum lived in the previous function, and the object had the same bit
pattern as the bignum, so on realloc the system thought it could replace
the machine word with the new location of the stale bignum.
So, to move objects around, there's a "C_migrate_buffer_object" function
which receives the boundaries of the local buffer and an allocation
pointer. It scans for objects still living in the local buffer and
moves those to the allocation pointer. Any slots pointing to objects
in the scratch area are detected, and the "slot_N" word immediately
preceding the object in the scratch area is updated to point to the
moved object's slot. This is done recursively, so that for example
a cplxnums containing two ratnums, which both contain two bignums will
be correctly updated, regardless of where any of the 7 objects lives.
To keep things manageable, C_migrate_buffer_object also accepts a NULL
allocation pointer to indicate that any objects in scratch space must
When a minor GC occurs, we simply do as always: simply start from the
live root continuation and copy every reachable object (including objects
living in the scratch space). Afterwards, the scratch space can be
freed, because it is guaranteed to only contain objects referred to
from the nursery (due to the fact that Scheme numbers are atomic, and
only the wrapper objects can point to the digits). The nursery is
completely empty after a minor GC.
The nice thing about this approach is that it is fully generic: any kind
of object can be stored in scratch space, as long as it consists of a
fixed size "wrapper" object with an arbitrarily large inner object that
cannot be referred to directly from Scheme, and cannot contain any
cycles. For example, srfi-4 vectors could be allocated in scratch space,
because they are constructed similarly: they are also represented as a
record containing a type tag and a string/blob as internal data storage.
This might speed up programs that create many srfi-4 vectors, because
their allocation can now be inlined. Other object types would probably
need to get a change of representation, so that's probably not worth
As you can see in the attached benchmark results, the performance impact
on non-bignum code is only very small, in most cases nil. The benchmarks
that are slower than plain CHICKEN 5 are due to the fact that a few
numeric operations are currently not inlineable: generic division, sqrt
and the trigonometric operators (fft, nbody, nucleic2(?)). In most cases,
it seems to simply be due to the fact that the operators now pre-allocate
more memory (puzzle, destructive(?), sboyer, div-iter, binarytrees, hanoi).
The latter may be fixable, but it will be tricky and complicate the code
even more: we could move the components of "extended" numeric types into
scratch space as well, so that every numeric operator will always only
put the outermost "wrapper" structure on the stack. The former is kind of
inherent, I think. If we make "/" inlineable and improve the scrutinizer
this effect might be reduced.
Unfortunately, fully generic division is quite ugly, especially if
translated to C. With the gymnastics required to keep the scratch space
objects in the right buffer and free them, it gets even worse. It can
certainly be done, but it didn't seem very important, especially since
"quotient", "remainder" and "modulo" _are_ inlineable. If you want to
see how to do this, take a look at how I translated multiplication,
addition and subtraction to C. Basically, you need to start with
converting the dyadic "##sys#/-2" operation and then build the variadic
one on top of that. Other things that could be made inlineable are
"round", "floor" and "ceil", "signum", "numerator" and "denominator"
and so on. However, we should decide where to draw the line, because
moving more to C won't benefit the readability of the core code.
Actually, making "*" inlineable is already a dubious choice, given the
fact that this didn't improve performance on our small benchmarks at all.
However, it cleaned up some of the other calculations in C very much, so
it's a useful thing to have.
Description: Text document
Description: Digital signature
Re: [Chicken-hackers] [PATCH] More efficient numbers integration through "scratchspace", Felix Winkelmann, 2015/04/21
- [Chicken-hackers] [PATCH] More efficient numbers integration through "scratchspace",
Peter Bex <=