[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Re : [Bug-gnubg] USE_SSE2

From: Ingo Macherius
Subject: RE: Re : [Bug-gnubg] USE_SSE2
Date: Mon, 17 Aug 2009 20:37:13 +0200

> Hmmm ... so at the moment we have two defines, 

These two #defines switch on and off the pseuo-assembler intrinsics code we
wrote ourselves. The "-msse" switch family makes the intrinsics we use
visible, but they in addition allow the compiler's own backend to generate
SSE instructions are generated from the internal CPU agnostic tree model.
This is actually their main purpose. The intrinsics we wrote are more or
less only required because of the still relatively poor auto-vectorizer in

> Questions:
> 1. I'm compiling with -msse (and/or -msse2) only the file 
> neuralnetsse.c as I rememebr there where issues compiling all 
> gnubg source files with -msse. Is this still correct ?

Limiting -msse and -msse2 to lib/*sse*.conly is a waste. MMX and
SSE{1,2,3,4} is by far not only useful for SIMD floating point operations.
They cover the full range of CPU activity, from cache control over fast SIMD
number crunching to bit fiddling and even OO support in hardware. Different
MMX/SSE versions also may expose additional registers, which always speed up
things. Any code will benefit from using higher SSE versions, regardless
what it does. Think of MMX, SSE, SSE1, ... etc. as IA-32 instructions set
version 2.0, 3.0, 4.0 etc. The higher your version, the more options for
fast code the compiler backend and optimizer have. 

Here is the number of machine language opcodes. Counting is not precise as
addressing modes etc. create flavors of commands.

IA-32 (i386 + i387): dunno :)
MMX: 46 new commands
SSE: 70 new commands
SSE2: 144 new commands
SSE3: 13 new commands

Given more commands the compiler obviously has more opportunity to mess
things up - just as well as it has more opportunities to speed things up.
However, I had no issues using any of the above instruction set blocks with
gcc 3 and 4. Tested on Xeon 3.0Ghz (almost a Xeon Nocona), Xeon Core2 5130,
Pentium 4 Northwood and Pentium Core2 T5750 (got no AMD machine, alas,
feedback welcome).
> 2. If yes, I would try to avoid compiling everything 3 times 
> just because of one file. 

This approach (implemented in the form of lib/neuralnetsse.c) means to limit
intrinsics to few "CPU dependent plugins" and leave the rest as pure i386
compile. While this solution is elegant and used by other software (i.e.
povray and Java do the same trick), it creates a lot of extra work and code,
as you just mentioned. Plus eval.c and friends do not benefit from SSE's
non-SIMD speedups with this approach either. While I see the beauty in this
solution,  I think it tries to solve a problem which in reality hardly
exists. How many users actually run 7 year old machines ?

> Ingo also wrote:
> >  =>  "gcc -mfpmath=sse -msse2 -msse" should be a no-brainer for the 
> > binary  distribution.
> Do we need to put both -msse and -msse2 (when we want sse2) ? 
> Cause the current win makefile has only -msse2 ...

Enter the gcc architectures. Note that "-march=pentium4" is just an alias
for "-mmmx -msse -msse2 -mno-sss3" plus some assumptions about caching and
command execution speed for each machine language opcode.

My suggestion is to use "-march=pentium4 -mfpmath=sse" (see gcc manpage for
a good survey) for all source files to produce the standard binaries
distributed for Windows. 

A speedup (for me up to 1/3, but I optimize very agressively) for 95% of all
gnubg users matters more than the incompatibility with old hardware for the
remaining 5%. Yes, I am frigging mean :) In the documentation, it must of
course be stated that a Pentium 4 or AMD Athlon 64 processor or newer is
required. Some testing with a wide machine variety is needed - and that is
what a test release would achive. Gnubg IS still 0.9, we can simply do it
and look if the users cry out loud. We also need to do it now, once 1.0 is
released it is to late for that sort of change for a while.


<anecdote read="optional">
One remark on the "-m..." switches, they are useful in a setting I
encountered with the Xeon 3.0 GHz machine. It has a Xeon with "Gallatin"
core, which is based on the Pentium 4 Northwood design. This means it is one
of the rare beasts which still lacks SSE3, but already has the x86-64
extensions. GCC however only allows to compile "-march=prescott" (SSE3 and
no x86-64), or "-march=nocona" (both SSE3 and x86-64).

Simple solution for my freak chip: "-march=nocona -mno-sse3". Problem
solved, works like a charm.

> http://www.gamedev.net/reference/articles/article1987.asp
> in the -masm=intel switch during compile."
> Is it realy necessary ?

Look at the page date, this is 2003 information, i.e. written when SSE2 was
rather new and GCC support for it lousy if existent at all. I do not think
this is relevant anymore. We use intrinsics, not naked assembler like the
page's example does. Intrinsics are a higher level of abstraction and
shouldn't have these low end problems.

I don't argue for using gcc4's "-ftree-vectorize" and other fancy uses of
SSE for optimization, which in my experience indeed introduce optimizer
bugs. Our hand written intrinsics ARE what that optimization ideally should
produce. No need, we have it. I argue to use the good old gcc3 but allow it
to use extended SSE1/2instructions set so it can benefit from ist non-SIMD
magic. Which *has* measurable speed implications still.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]