emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal: block-based vector allocator


From: Stefan Monnier
Subject: Re: Proposal: block-based vector allocator
Date: Tue, 06 Dec 2011 14:39:09 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux)

> The most basic idea was to maintain separate vector blocks for vectors
> of each possible size (with the room enough to 0, 1, 2, etc. Lisp_Objects -
> let's call it order0, order1 and so on), up to 4x1K vectors in 4K blocks,
> and dedicated list for >1K vectors. This scheme is simple, but it tends to
> hold too much space in free lists, and this space is of 'wrong order'. It was
> typical to have 1000 free slots in order[234] blocks, but no free space in
> order50 blocks.

Are you saying that you kept separate blocks for 7-slot vectors and
8-slot vectors?  Yes, that tends to lead to too-many blocks.
Instead, each block should cover a range of sizes.

Example for 32bit system: since vectors, like all objects, need to be
aligned on a multiple of 8, and have at least one metadata field (the
vector's size and gcmarkbit), that means that the minimum size of tiny
vectors is:

   0&1-slot:   8bytes.
   2&3-slots: 16bytes.
   4&5-slots: 24bytes.
   6&7-slots: 32bytes.

[If we keep the "next" field (even tho it's not needed for those
vectors if we don't split&coalesce), it shifts things by 1, of course]

The best we can do (with 4KB blocks) without splitting/coalescing is:
- order16 : 256 objects of size <= 16bytes.
- order24 : 170 objects of size <= 24bytes.
- order32 : 128 objects of size <= 32bytes.
- order40 : 102 objects of size <= 40bytes.
- order48 :  85 objects of size <= 48bytes.
- order56 :  73 objects of size <= 56bytes.
- order64 :  64 objects of size <= 64bytes.
- order72 :  56 objects of size <= 72bytes.
- order80 :  51 objects of size <= 80bytes.
- order88 :  46 objects of size <= 88bytes.
- order96 :  42 objects of size <= 96bytes.
- order104:  39 objects of size <= 104bytes.
- order112:  36 objects of size <= 112bytes.
- order120:  34 objects of size <= 120bytes.
- order128:  32 objects of size <= 128bytes.
- order136:  30 objects of size <= 136bytes.
- order144:  28 objects of size <= 144bytes.
- order152:  26 objects of size <= 152bytes.
- order160:  25 objects of size <= 160bytes.
- order168:  24 objects of size <= 168bytes.
- order176:  23 objects of size <= 176bytes.
- order184:  22 objects of size <= 184bytes.
- order192:  21 objects of size <= 192bytes.
- order200:  20 objects of size <= 200bytes.
- order208:  19 objects of size <= 208bytes.
(skip order216:  18 objects of size <= 216bytes)
- order224:  18 objects of size <= 224bytes.
(skip order232:  17 objects of size <= 232bytes)
- order240:  17 objects of size <= 240bytes.
- order256:  16 objects of size <= 256bytes.
- order272:  15 objects of size <= 272bytes.
- order288:  14 objects of size <= 288bytes.
- order312:  13 objects of size <= 312bytes.
- order336:  12 objects of size <= 336bytes.
- order368:  11 objects of size <= 368bytes.
- order408:  10 objects of size <= 408bytes.
- order448:   9 objects of size <= 448bytes.
- order512:   8 objects of size <= 512bytes.
- order584:   7 objects of size <= 584bytes.
- order680:   6 objects of size <= 680bytes.
- order816:   5 objects of size <= 816bytes.
- order1024:  4 objects of size <= 1024bytes.
- order1360:  3 objects of size <= 1360bytes.
- order2048:  2 objects of size <= 2048bytes.
- order4096:  1 objects of size <= 4096bytes.

I.e. 44 different orders.  As objects grow, so does the wasted space,
e.g. for order512 the max wasted space (a 456bytes object) is already
12%, so we'd set a limit (e.g. 200bytes) after which we'd use the large
vector code.

Is that more or less what you had done?  It's a very standard and
popular malloc algorithm, so I'm surprised you say it didn't work well.

If the problem is that we have many "mostly empty" blocks of the wrong
size, then we can do the following:
- reduce the block size to 1024 (as used for cons cells and floats, BTW).
- impose an alignment on multiple of 16 (probably a good idea anyway
  since it will give us a bit more tag space) which reduces the number
  of orders accordingly.

E.g. for 1024bytes blocks with a "multiple of 16" alignment, we have
much fewer orders, and hence correspondingly fewer "mostly empty" blocks:
- order16 : 64 objects of size <= 16bytes.
- order32 : 32 objects of size <= 32bytes.
- order48 : 21 objects of size <= 48bytes.
- order64 : 16 objects of size <= 64bytes.
- order80 : 12 objects of size <= 80bytes.
- order96 : 10 objects of size <= 96bytes.
- order112:  9 objects of size <= 112bytes.
- order128:  8 objects of size <= 128bytes.
- order144:  7 objects of size <= 144bytes.
- order160:  6 objects of size <= 160bytes.
- order192:  5 objects of size <= 192bytes.
- order256:  4 objects of size <= 256bytes.
- order336:  3 objects of size <= 336bytes.
- order512:  2 objects of size <= 512bytes.
- order1024: 1 objects of size <= 1024bytes.
  
> As a result, the total memory consumption was _more_ than current.

I don't think we should care about 0-slot and 1-slot vectors too much,
and even 2-slot vectors are not super important (code that really cares
about it can use a cons cell instead in most cases).  Also, currently,
3-slot vectors take up 24bytes for the Lisp_Vectorlike object plus one
mem_node entry in the red/black tree (which is 7words (could be brought
down to 6), i.e. 32bytes), for a grand total of 56bytes (that's without
counting malloc's own overhead, tho it seemed to be fairly small last
time I looked at it).

So we should have a fair bit of wiggle room here.  Especially since I'm
not specifically interested in reducing memory use as much as getting
rid of the O(log n) overhead and generally making smallish
"defstruct-style" vectors more efficient.

E.g. the "56bit overhead" corresponds more or less to the overhead of
placing a 452B objects in the order512 (for 4096B blocks), so we should
be able to allocate all vectors smaller than 512B in such a scheme
without suffering much increased memory use (other than having some
number of mostly empty blocks).

> block in corresponding free list. In this case, proposed allocator will not
> consider 608 and 616-byte free lists, because splitting gives 8 and 16-bytes
> tails, which can't be used to hold any vector (except 0-sized, which is very
> rare). But it will consider free list with 624-bytes blocks, because 24-bytes
> tail can be used for the vector holds 1 Lisp_Object.

Regardless of whether we can find 1-slot vector to fill this, 624bytes
for a 600bytes object is just 4% overhead, so it's not worth wasting too
much effort trying to make use of that space.  It's still much smaller
than the 56bytes (for 32bit systems) of mem_node we currently allocate
for every vector.

> If it fails, next candidate is 632-byte block free list, and so
> on. And it's very likely that 24-bytes (or 32, or 40, etc.) tail will
> be used soon, because most vectors are small.

Why have different classes for 632B and 600B sizes?  You can't fit more
than 6 vectors of either size in a 4KB block, so you might as well
divide your "order600" block into 6 chunks of 680B (plus 16B of padding)
rather than 6 chunks of 600B plus an unused 496B of padding.


        Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]