Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
From:
Eugene Grayver
Subject:
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Date:
Tue, 11 Dec 2007 15:41:46 -0800
Please see answers in-line.
Thanks!
________________________
Eric Blossom <address@hidden>
12/11/2007 02:31 PM
To
Eugene Grayver <address@hidden>
cc
address@hidden
Subject
Re: [Discuss-gnuradio] Re-writing blocks
using intel libraries
On Tue, Dec 11, 2007 at 10:13:32AM -0800, Eugene Grayver
wrote:
> Hello,
>
> We are working on some systems that require high sampling rates. I
am
> already using the Intel C++ compiler at the highest optimization ratio,
> but a lot of the blocks are very slow still. It appears that
intel C++
> does not properly vectorize <complex> data type.
General curiosity questions:
Are you using oprofile to measure performance?
I am a bit of a maverick, and for various reasons
am using a pure C++ environment. I hacked my own 'connect_block'
function (can;t wait for v3.2, where these will be part of native gr).
I am measuring the performance using a custom block (gr_throughput)
that simply reports the average number of samples processed per second.
What h/w platform are you running on / tuning for?
The platform is currently Intel Xeon or Core2 Duo.
You're not trying to run your app on a cache-crippled machine like
a
Celeron, are you? ;)
No, very high end.
Which blocks are causing you the biggest problem?
I got a 2x improvement on all the filtering blocks.
About a 40% improvement for sine/cosine generation blocks. This
includes gr_expj, gr_rotate.
Are your problems caused primarily by lack of CPU cycles, cache
misses or mis-predicted branches?
I am not sure, since I am not at all a software expect
(mostly dsp/comm). My guess is that the SSE instructions are not
being used (or not used to a full extent). Even the 'multiply' block
is VERY slow compared to a vector x vector multiplication in the Intel
library. Some of the gr_blocks process each sample using a separate
function call (e.g. for (n=0; n<noutput_samples; n++) scale(in[n])
Replacing this with a single vectorized function call
is much faster.
> I have been replacing almost every low level block with a functionally
> equivalent using the intel performance libraries (IPP). These
libraries
> are not GPL, but are free for noncommercial use under Linux ($200
> otherwise). At some point, I would like to contribute our work
back to
> gnuradio. Would this fit with the gr philosophy? How should
we structure
> the code? (i.e. have a separate set of files, use #defines,
or ...)?
>
> Eugene
We would not accept the changes. Part of what we're up to is building
an ever expanding universe of free code. Instead of using the
non-free IPP code, please consider using a free library such as ATLAS,
or help us find and fix performance challenges in a way that doesn't
require non-free code. Also, are you sure that your performance
issues can't be better addressed with an algorithmic change? If
you're using a lot of very low-level blocks (e.g., add, multiply,
etc.) you're probably better off writing a block that aggregates some
of the operations into a single block.
That's what I expected. We'll try to contribute
the more dsp-centric blocks such as demodulators.