No-threading poke, VM performance

poke-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
No-threading poke, VM performance

From:	Luca Saiu
Subject:	No-threading poke, VM performance
Date:	Mon, 21 Dec 2020 14:35:20 +0100
User-agent:	Gnus (Gnus v5.13), GNU Emacs 27.0.50, x86_64-pc-linux-gnu
Hello José, and the others.

José told me that poke's performance in general could be improved.  As
requested by him I am sending this writeup recording our IRC
conversation, and more.  I apologise if by not proposing to immediately
solve these problems myself I will make my proposals feel preachy.  I
think they are interesting anyway for VM writers.


I would like some very small microbenchmark as well; for the time being
I will be using something like this:

  var euclid
    = lambda (long a, long b) long :
        { while (a != b)
            if (a < b)
              b -= a;
            else
              a -= b;
            return a; }
  .vm disassemble function euclid
  .vm disassemble function/n euclid

There are a few obvious problems:

a) pushvar relies on C function calls.

This should be trivial to optimise by turning the function into a macro
and specialising for common instruction arguments.  The same holds for
popvar, and probably related instructions such as pushe.

I am not worried about this.

b1) Arithmetic is *extremely* expensive in the common case on the PVM,
    as it involves GC help allocation of results, and more.

I would suggest changing the data representation by following Lisp's
idea of fixnums: representing unboxed integers which are small enough to
fit in a word, along with their tag bits.  If, for example, you needed
10 bits for tags (I am being very pessimistic here: in practice you will
need fewer) then, on a 64-bit machine, you could represent with a
"long-unboxed" tag any 64-bit number fitting sign-extended in 54 bits.
Your arithmetic operations will involve a conditional, fast-branching to
uncommonly executed code in case a result does not fit.

I would also recommend handling out-of-line any case where the
*argument* is not unboxed: all that is non-critical code, which will be
executed rarely and will have no effect on performance.  You can do all
of that in C functions, and the common case using CPP macros only.

I also see much more bit twiddling than I was expecting.  This is a long
subtraction on the PVM:

# subl (77 B):
    0x00007fcbc92214e0 49 8b 45 00              movq   0x0(%r13),%rax
    0x00007fcbc92214e4 48 83 e0 f8              andq   $0xfffffffffffffff8,%rax
    0x00007fcbc92214e8 48 8b 48 08              movq   0x8(%rax),%rcx
    0x00007fcbc92214ec 8d 71 01                 leal   0x1(%rcx),%esi
    0x00007fcbc92214ef 48 8b 38                 movq   (%rax),%rdi
    0x00007fcbc92214f2 ba 3f 00 00 00           movl   $0x3f,%edx
    0x00007fcbc92214f7 89 d0                    movl   %edx,%eax
    0x00007fcbc92214f9 29 c8                    subl   %ecx,%eax
    0x00007fcbc92214fb 89 c1                    movl   %eax,%ecx
    0x00007fcbc92214fd 48 d3 e7                 shlq   %cl,%rdi
    0x00007fcbc9221500 48 d3 ff                 sarq   %cl,%rdi
    0x00007fcbc9221503 4c 89 f1                 movq   %r14,%rcx
    0x00007fcbc9221506 48 83 e1 f8              andq   $0xfffffffffffffff8,%rcx
    0x00007fcbc922150a 48 8b 01                 movq   (%rcx),%rax
    0x00007fcbc922150d 2b 51 08                 subl   0x8(%rcx),%edx
    0x00007fcbc9221510 89 d1                    movl   %edx,%ecx
    0x00007fcbc9221512 48 d3 e0                 shlq   %cl,%rax
    0x00007fcbc9221515 48 d3 f8                 sarq   %cl,%rax
    0x00007fcbc9221518 48 29 c7                 subq   %rax,%rdi
    0x00007fcbc922151b 48 8b 44 24 50           movq   0x50(%rsp),%rax
    0x00007fcbc9221520 ff d0                    callq  *%rax
    0x00007fcbc9221522 4d 89 75 08              movq   %r14,0x8(%r13)
    0x00007fcbc9221526 49 83 c5 08              addq   $0x8,%r13
    0x00007fcbc922152a 49 89 c6                 movq   %rax,%r14

Even ignoring the allocation near the end, this really seems too much.

This is JitterLisp's subtraction.  I believe your code should end up
not bigger than it:

# primitive-primordial-minus/fR 0x560af169f860 (45 B):
    0x00007f853fc7e085 49 f7 06 0f 00 00 00     testq  $0xf,(%r14)
    0x00007f853fc7e08c 0f 85 2e 00 00 00        jne    0x00007f853fc7e0c0
    0x00007f853fc7e092 48 f7 c1 0f 00 00 00     testq  $0xf,%rcx
    0x00007f853fc7e099 0f 85 21 00 00 00        jne    0x00007f853fc7e0c0
    0x00007f853fc7e09f 49 8b 06                 movq   (%r14),%rax
    0x00007f853fc7e0a2 48 29 c8                 subq   %rcx,%rax
    0x00007f853fc7e0a5 0f 80 15 00 00 00        jo     0x00007f853fc7e0c0
    0x00007f853fc7e0ab 48 89 c1                 movq   %rax,%rcx
    0x00007f853fc7e0ae 49 83 ee 08              subq   $0x8,%r14

Before saying that this code is also too complex for a subtraction
remember that this is a dynamically typed language, and look at what it
does:

Tag checking of the first argument, read from memory from the stack
under-top: branch out-of-line if the argument is not a fixnum:
    0x00007f853fc7e085 49 f7 06 0f 00 00 00     testq  $0xf,(%r14)
    0x00007f853fc7e08c 0f 85 2e 00 00 00        jne    0x00007f853fc7e0c0

Same thing for the second argument, on the top of the TOS-optimised
stack and therefore in a register:
    0x00007f853fc7e092 48 f7 c1 0f 00 00 00     testq  $0xf,%rcx
    0x00007f853fc7e099 0f 85 21 00 00 00        jne    0x00007f853fc7e0c0

Subtraction *with overflow checking*: branch out-of-line on overflow:
    0x00007f853fc7e09f 49 8b 06                 movq   (%r14),%rax
    0x00007f853fc7e0a2 48 29 c8                 subq   %rcx,%rax
    0x00007f853fc7e0a5 0f 80 15 00 00 00        jo     0x00007f853fc7e0c0

If we did not overflow copy the subtraction result to its place, which is
the TOS:
    0x00007f853fc7e0ab 48 89 c1                 movq   %rax,%rcx

Adjust the stack, which is now one element shorter.  This is a nip operation:
    0x00007f853fc7e0ae 49 83 ee 08              subq   $0x8,%r14

The complexity of tagging and operations on tagged data is in CPP macros.

Now, my fixnum representation is clever in the sense that it uses the
tag 0 for fixnums in the low bits (currently three bits: I could afford at
least four on 64-bits platforms), which allows sum and subtraction to
work on *tagged* data.
A tag valued zero is also faster to check, compared to any other value.
The one type you choose to tag that way will be faster to handle.

Compare, for example, car:

# primitive-car/fR 0x560af16a0780 (20 B):
    0x00007f853fc7d3f5 48 8d 41 fe              leaq   -0x2(%rcx),%rax
    0x00007f853fc7d3f9 48 a9 07 00 00 00        testq  $0x7,%rax
    0x00007f853fc7d3ff 0f 85 12 00 00 00        jne    0x00007f853fc7d417
    0x00007f853fc7d405 48 8b 49 fe              movq   -0x2(%rcx),%rcx

In JitterLisp car does not work on nil.  Here leaq is just a way to
compute a three-operand subtraction on x86_64.  In practice in order to
check whether the argument is a cons I subtract the desired tag (0x2) from
the tagged pointer, and test if the result's tag bits are zero.  This
check costs one more instruction because the tag is non-zero.
Then branch away on non-cons.
(And then, finally, do the actual work: load, using an offset of -0x2 to
compensate for the tag.)

If you adopt this style you will have the same asymmetry: the zero tag
will be slightly faster to handle than the others.  There will be *one*
integer type preferred for performance.
In practice, even if the distinction between boxed and unboxed numbers
would still be invisible to the Poke user, we are talking about a
dynamic-typing approach.

Which brings me to my last point c).  Before that, a digression on
something which should be easy to improve:

b2) Comparisons materialise their result, which is boxed (!), on the
    stack

Look at the disassembly for
  ltl
  nip2
  bzi $L
, which you can find compiling the euclid function.

This is a typical problem with stack code, which on a Jittery VM can be
solved with rewriting at least in typical cases.

Here one big mistake is making the Boolean result be a *boxed* integer.
And of course, as always in these cases, you are computing two
conditionals at run time instead of one: one for knowing which Boolean
to push, another depending on the Boolean value.

I have no constructive proposals for avoiding this pattern in Poke's
compiler, but rewrites alone might suffice.  You could rewrite to a new
"branch-on-not-less" instruction in this case, if needed even
reproducing the same stack effect possibly by PUSH_UNSPECIFIED.

c) Uncommonly executed code is inline.

There is a lot of code inside PVM instruction following the pattern

  if (ERROR CONDITION)
    PVM_RAISE (...);
  else
    ...

Apart from the problems of PVM_RAISE, which right now is not replicable
because it relies on string literals, this makes VM instructions very
large, with branches which are taken in the common case.  It also
pollutes the instruction cache with unused code, and having large
instructions will be a problem later when we optimise deeper stack
accesses, which will rely on more replication.

Not doing this may interfere with Poke's compiler logic.  Let me first
show you the desired code; I will end with a constructive proposal.

JitterLisp does not follow this practice yet, but its VM instructions
are designed for it.  Say that I want to do something with the result of
a subtraction:

     ...push the two subtraction operands on the stack...
     primitive-primordial-minus $slow_path
$after_subtraction:
     ...use the subtraction result...

$slow_path:
     subtraction-in-c
     branch $after_subtraction

The important point here is that the slow path has negligible impact on
performance: it can check for type errors (not needed in Poke) and for
correct but uncommon tags in the input or in the output (needed in
Poke).  This code can be as complicated as we need, and it can be
written in comfortable unrestricted C, using a function.  The slow path,
if there is no failure, will branch back into the common path and that
the consumer of the subtraction result does not need to be aware of
anything special.

This should also happen for safe points (currently "sync" in the PVM).
Look at example-vms/uninspired/examples/signals.vm , which is a
cleaned-up version of an example I originally wrote for you on the
mailing list.

The safe-point VM instruction itself should compile to two or three
hardware instructions, with a non-taken branch:

# safe-point/fR 0x55b2e7ba21f0 (11 B):
    0x00007f5589c7ea11 48 83 7b f0 00           cmpq   $0x0,-0x10(%rbx)
    0x00007f5589c7ea16 0f 85 08 00 00 00        jne    0x00007f5589c7ea24

The code for handling notifications, in case they actually arrive, will
be in the slow path.

Important remark: Jitter's new GC will encourage this style: allocation
is extremely fast in the common path -- at four instructions, cheaper
than a C function call plus return in C, with a non-taken branch.
The slow path will call into C.

     allocate-cons %r0, $slow_path
$after_allocation:
     ...Fill the new cons in %r0...

$slow_path:
     ...some instruction which can afford to be inefficient...
     branch $after_allocation

This style should be pervasive.  Modifying the Poke compiler to provide
for it in every place seems hard; however I have a proposal.

José would it be easy to integrate this hypothetical Jitter feature into
Poke's code generator?

Jittery VM subsections
======================

Right now code is generated for a Jittery VM in a strictly sequential
order.  If I have to compile
  if EXP1 then
    STMT1
  else
    STMT2
  ...

the code will be something like:
      [EXP1]
      branch-on-false $else
      [STMT1]
      branch $end
$else:
      [STMT2]
$end:
      ...

And the code generator has to generate, for example, the instructions
for STMT1 before the instructions for STMT2.

I would like to add a feature similar to Gas subsections (the name for
Jitter should be different)

From the point of view of the compiler the code could also be:
      [EXP1]
      branch-on-false $else
      [STMT1]
      branch $end
  .pushsubsection
$else:
      [STMT2]
  .popsubsection
$end:
      ...

This does not seem to make things easier, but now imagine that the else
branch be very unlikely.  Then this variant becomes attractive (two
taken branches in the else case, no taken branch in the then case):

      [EXP1]
      branch-on-false $else
      [STMT1]
$end:
      ...

$else:
      [STMT2]
      branch $end

This code would be easier to generate if we could afford not to follow
the instruction order at generation time:

      [EXP1]
      branch-on-false $else
      [STMT1]
   .pushsubsection
$else:
      [STMT2]
      branch $end
   .popsubsection
$end:
      ...

In fact I already follow this logic in my compilers, even without
subsections, by making my code generators recursive; but I doubt that
this would work with Poke.  The advantage I imagine here is generating
slow paths at the same time as fast paths.  Could something like this be
hidden in your macro instructions using RAS or some other Poke
mechanism?  If it could then solving the problems in c) would become
much less disruptive.

I plan to provide the "subsection" stack in Jitter.  It should not be
terribly hard.

What do you think?

-- 
Luca Saiu
* My personal web site:  http://ageinghacker.net
* GNU epsilon:           http://www.gnu.org/software/epsilon
* Jitter:                http://ageinghacker.net/projects/jitter

I support everyone's freedom of mocking any opinion or belief, no
matter how deeply held, with open disrespect and the same unrelented
enthusiasm of a toddler who has just learned the word "poo".
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
No-threading poke, VM performance, Luca Saiu <=
- Re: No-threading poke, VM performance, Luca Saiu, 2020/12/21
Prev by Date: No-threading Poke, PVM correctness
Next by Date: Re: No-threading poke, VM performance
Previous by thread: No-threading Poke, PVM correctness
Next by thread: Re: No-threading poke, VM performance
Index(es):
- Date
- Thread