[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-gnubg] Benchmarks, experiments, speedups

From: Ingo Macherius
Subject: [Bug-gnubg] Benchmarks, experiments, speedups
Date: Tue, 22 Feb 2005 16:18:10 +0100

I spent some looking into performance improvements for gnubg by "cheap
tricks", i.e. compiler settings and other non-algorithmic ways to speed up

1) Compiler

The SSE and SSE2 extensions to the x86 processor set introduce a limited
SIMD (single instruction multiple date) capability to more recent Pentiums
and AMDs. The bottomline is, that 4 floating point or interger (not bit!)
operation can be done in one cycle under certain circumstances. That simply
makes the loop 4 times faster than in 386 code. As the neural net is the
determining part of the CPU usage and it is not suprising that optimizing
with SSE support gives a definitive edge. Here are results with different
binaries on my machine (P4C 2.4Ghz with SSE2):

386 :  23751 static evaluations/second
SSE :  27871 static evaluations/second
SSE2:  32311 static evaluations/second
SSE3:  n/a (does not work on my hardware, no data)

The processors offering these levels are rougly

386 :  Any 32 Bit x86 bit CPU (386 or better)
SSE :  AMD Athlon XP "Palomino", Intel Pentium III (both since ~ 2001)
SSE2:  Intel Pentium 4A and Xenon, AMD Opteron, AMD-64 (since ~ 2003)
SSE3:  Intel Pentium 4F "Prescott" and Xenon, AMD Opteron 852 and 252

2) Loops

SSE2 vectorizes exactly 4 numbers and thus can only work on loops of a
length divideable by 4. Furthermore, it is required that the exact loop
length is a known at compile time. In neuralnet.c, we have most loops in
this style:

for( i = 0; i < pnn->cHidden; i++ )
        ar[ i ] = pnn->arHiddenThreshold[ i ];

While this is nice and general, it effectively stops the compiler from
vectorizing (using the SIMD instructions) and parallelizing (using
hyperthreading to execute two loops instead of one in parallel). This is
important because these loops happen in the single most hotspot of gnubg,
the neural net EvaluateOld(...) and EvaluateFromBase methods.

Here is a simple idea to improve the matter. For any release of the
networks, their size is fixed. I.e. for weights version 0.15, we have these

250 128 5 1   0.1000000   1.0000000 (inputs, hidden, outputs, ...)
214 128 5 1   0.1000000   1.0000000
250 128 5 1   0.1000000   1.0000000
200 10 5 1268454366   0.1000000   1.0000000
200 5 5 258840464   0.1000000   1.0000000
200 5 5 250675595   0.1000000   1.0000000

If you make a switch statement and subroutine into versions of EvaluateOld
which are set up with fixed length loops, all of it can be vectorized. I
implemented this and found an improvement of some percent, but less than 5%,
with SSE2 binaries. Still not bad.

I could submit the patch, but it is pretty ugly as it requires #include and
template hacks. Anybody in favor of it? If anybody has a better idea how to
guarantee fixed length loops at compile time without templates, let me know.
And there are more code sections to check for SSE2 compliance, a thorough
code review in eval.c may yield some more %.

3) Blas subroutines

In neuralnet.c there is experimental code for libatlas / Blas. Because of
this code in every call of call of Evaluate an uncoditional subroutine call
to EvaluateOld is done. I've tried ATLAS (a hell of a compile!) and found it
to deliver only 16000 eval/s even even when highly optimized. The cost of a
library call seems to simply undo the effect of better Matrix multiplication

May we remove the unconditional jump by simply renaming EvaluateOld and
EvaluateFromBaseOld back to Evaluate and EvaluateFromBase? (small but real
performance improvement). The Blas routines can simply be renamed to

4) GPU based evaluation

Have you ever asked yourself what is the fastest processor in your machine?
In mine it's not the CPU, but the 3d graphic card (GPU). It can do 4x4
matrix operations very fast and massively parallel. The neural net
evaluation is basically composed of 2 vector/sclar multilications (the
sigmoids) and 2 vector/matrix multiplications (the sums).

There is recent research on using GPU as what is called GPUGP (GPU general
purpose). There are some promising results, e.g.
edd5ffeadb84d63e03c claims a 20 times speedup for a GPU based NN. Or this
guy http://www.cs.stevens.edu/~quynh/student-work/acorrigan_gpu.htm has even
some code attached.

My GPU is 2 year old model with 300 MHz and 8 pipelines, that yields 2.4
billion 4x4 matrix multiplications(or scalar multiplications with a 4
element vector) per second. And the CPU is idle meanwhile, working in
parallel ... Breaking down general algebra into an algebra over 4x4 matrices
is a hot research item. And there are two implementations that actually do
it, now. They also hide differences between underlaying hardware details, by
using OpenGL or DirectX as an intermediate layer.


There are some more projects, but these two are the major ones. I've looked
into both and found them to be at alpha stage, but already useable. But it's
still not as easy as a numerical library like blas would be. However, for
the records: this may be a big improvement for gnubg, I'll revisit it in a
few month.

And, as a nice closer, here is a gadget for comparing CPU/CPU+SSEx/GPU


reply via email to

[Prev in Thread] Current Thread [Next in Thread]