[Chicken-users] optimizing mutually recursive loops

chicken-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-users] optimizing mutually recursive loops

From:	Alex Shinn
Subject:	[Chicken-users] optimizing mutually recursive loops
Date:	Mon, 16 Feb 2009 16:32:36 +0900
User-agent:	Gnus/5.11 (Gnus v5.11) Emacs/22.3 (darwin)

When optimizing my Chicken code for speed, I find one of the
most important things is to write the code so that it gets
compiled into a single C function with gotos.  It's often so
important to get this that I end up writing very unnatural
Scheme code, including manually combining several loops into
one.

For a simple example, if you look at the canonical mutual
recursion example:

  (letrec ((myeven? (lambda (x) (if (zero? x) #t (myodd? (- x 1)))))
           (myodd? (lambda (x) (if (zero? x) #f (myeven? (- x 1))))))
    (myeven? n))

this will generate two mutually recursive C functions in the
backend.  Because the overhead of the body of these
functions is negligible, the cost of the function overhead
is relatively high.

One can consider, instead, combining the two procedures into
a single, self-recursive procedure.  We do this by
prepending a label to the argument list, with an integer
value assigned to each of the virtual procedures, and
dispatch accordingly.  The above case becomes something
like:

  (let dispatch ((label 0) (x n))
    (if (eq? label 0)
        (if (zero? x) #t (dispatch 1 (- x 1)))
        (if (zero? x) #f (dispatch 0 (- x 1)))))

Here the label 0 is for myeven?, and 1 is for myodd?.  At
first glance, apart from being much harder to read, this
would seem to be a pessimisation - we're doing the extra
work of a conditional test on each of the loop steps.
However, in the C backend this gets compiled to a single
procedure with self-tail-calls contracted to gotos.

Whenever you find yourself writing ugly code in the name of
speed, you should stop and try to abstract away the
ugliness.  To that end, the let-machine syntax at the end of
this article will automate the process for you.  Given any
letrec expression of literal lambdas, you can simply replace
letrec with let-machine.  In this case:

  (let-machine ((myeven? (lambda (x) (if (zero? x) #t (myodd? (- x 1)))))
                (myodd? (lambda (x) (if (zero? x) #f (myeven? (- x 1))))))
    (myeven? n))

will generate code similar to the manually unified loop
above.

Does it help?  For this example with n=400000000, I get
8.703 seconds for the letrec version compared to 1.671
seconds for the let-machine version - a better than 5x
improvement!

However, this is a toy example where the real work is almost
nil and the function overhead everything.  There's also an
important caveat - the goto contraction can only happen in
nested chains of inlined primitives.  If you call an
arbitrary procedure, it will always generate a full function
call.  That's a big caveat (and learning exactly when the
goto's get generated takes practice).

So are there useful examples of loops involving only
primitives?  A good example of an expensive nested loop is
the matrix multiplication algorithm.  To multiple matrix A x
B, the resulting matrix C can be determined by

   C(i,j) = A[i,:] x B[:,j]

i.e. the dot product of the ith row of A with the jth column
of B.  In Scheme (with matrices as vectors and passing
explicit row/column information):

  (define (matrix-multiply a a-rows a-cols b b-rows b-cols)
    (assert (= a-cols b-rows))
    (let ((c (make-vector (* a-rows b-cols))))
      (let lp1 ((i 0))
        (if (= i a-rows)
            c
            (let lp2 ((j 0))
              (if (= j b-cols)
                  (lp1 (+ i 1))
                  (let lp3 ((k 0) (sum 0))
                    (if (= k a-cols)
                        (begin
                          (vector-set! c (+ (* i a-rows) j) sum)
                          (lp2 (+ j 1)))
                        (lp3 (+ k 1)
                             (+ sum (* (vector-ref a (+ (* i a-cols) k))
                                       (vector-ref b (+ (* k b-cols) j)))))
                      ))))))))

[This would be a little easier to read with do, and a lot
easier to read with foof-loop, but I want to make the
recursive calls explicit.]

The innermost loop lp3 is computing the dot product.  It's
just performing simple arithmetic operations on each step
but the last, so it will be compiled normally with a goto so
long as usual integrations are enabled.

However, the outer loops are also just composed of
primitives, plus calls to the (known) inner loops.  What
would happen if we could flatten them?

It turns out, we _can_ flatten them.  By writing them the
way we did, with explicit calls back to the outer loop at
the end of each loop, we've showed how they are mutually
recursive.  Thus they allow a straightforward conversion to
letrec, and then let-machine:

  (define (fast-matrix-multiply a a-rows a-cols b b-rows b-cols)
    (assert (= a-cols b-rows))
    (let ((c (make-vector (* a-rows b-cols))))
      (let-machine
          ((lp1 (lambda (i) (if (= i a-rows) c (lp2 0))))
           (lp2 (lambda (j) (if (= j b-cols) (lp1 (+ i 1)) (lp3 0 0))))
           (lp3 (lambda (k sum)
                  (if (= k a-cols)
                      (begin
                        (vector-set! c (+ (* i a-rows) j) sum)
                        (lp2 (+ j 1)))
                      (lp3 (+ k 1)
                           (+ sum (* (vector-ref a (+ (* i a-cols) k))
                                     (vector-ref b (+ (* k b-cols) j)))))
                      ))))
        (lp1 0))))

So will this really be faster?  We've actually put more work
into the innermost loop (the most important loop), by
requiring the dispatch check on the label, but the whole
thing now compiles to a single C function, recursing only
with goto.  The result of multiplying two 300x300 matrices
on my machine is 1.057 seconds for the nested version, and
0.714 seconds for the let-machine version, more than a 30%
speed reduction for a real-world procedure where speed is
crucial.

If you look at the timings, all of the difference is
accounted for by minor GC - Cheney on the MTA pushes unknown
procedure calls as stack frames to be collected later,
whereas goto doesn't push a frame.  Consequently, the speed
improvement is less noticeable in other Scheme
implementations, and is sometimes slightly slower.  Try it
out if you like - your mileage may vary.

-- 
Alex


;; let-machine implementation - use riaxpander or syntactic-closures

(define-syntax let-machine
  (er-macro-transformer
   (lambda (expr rename compare)
     (let* ((procs
             (map
              (lambda (x)
                (cond
                 ((not (symbol? (car x)))
                  (error "not a symbol" (car x)))
                 ((not (and (= 3 (length (cadr x)))
                            (eq? 'lambda (caadr x))
                            (list? (cadadr x))))
                  (error "let-machine variables must all be lambdas"))
                 (else
                  (cons (car x) (cdadr x)))))
              (cadr expr)))
            (all-vars (apply lset-union eq? (map cadr procs)))
            (free-vars (cons 'label (append (map car procs) all-vars)))
            (init (caddr expr)))
       `(,(rename 'letrec)
         ((dispatch
           (lambda (label ,@all-vars)
             (letrec-syntax
                 ,(map
                   (lambda (p i)
                     `(,(car p)
                       (syntax-rules ()
                         ((_ ,@(cadr p)) (dispatch ,i ,@all-vars)))))
                   procs
                   (iota (length procs)))
               (,(rename 'cond)
                ,@(map
                   (lambda (p i) `((,(rename 'eq?) label ,i) ,(caddr p)))
                   (drop-right procs 1)
                   (iota (- (length procs) 1)))
                (else ,(caddr (last procs))))))))
         (letrec-syntax
             ,(map
               (lambda (p i)
                 `(,(car p)
                   (syntax-rules ()
                     ((_ ,@(cadr p))
                      (dispatch ,i
                                ,@(map (lambda (v) (if (memq v (cadr p)) v #f))
                                       all-vars))))))
               procs
               (iota (length procs)))
           ,init))))))

[Prev in Thread]

Current Thread

[Next in Thread]

[Chicken-users] optimizing mutually recursive loops, Alex Shinn <=
- Re: [Chicken-users] optimizing mutually recursive loops, Alaric Snell-Pym, 2009/02/16
  - Re: [Chicken-users] optimizing mutually recursive loops, Tobia Conforto, 2009/02/16
    - Re: [Chicken-users] optimizing mutually recursive loops, Alaric Snell-Pym, 2009/02/17
    - Re: [Chicken-users] optimizing mutually recursive loops, Peter Bex, 2009/02/17
    - Re: [Chicken-users] optimizing mutually recursive loops, John Cowan, 2009/02/17
    - Re: [Chicken-users] optimizing mutually recursive loops, felix winkelmann, 2009/02/17
- Re: [Chicken-users] optimizing mutually recursive loops, felix winkelmann, 2009/02/17

Prev by Date: Re: [Chicken-users] SDL egg: SDL_GL_SwapBuffers() not implemented?
Next by Date: Re: [Chicken-users] optimizing mutually recursive loops
Previous by thread: Re: [Chicken-users] newbie: questions about set-finalizer!
Next by thread: Re: [Chicken-users] optimizing mutually recursive loops
Index(es):
- Date
- Thread