[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

From: Eugene Grayver
Subject: Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Date: Tue, 11 Dec 2007 15:41:46 -0800

Please see answers in-line.


Eric Blossom <address@hidden>

12/11/2007 02:31 PM

Eugene Grayver <address@hidden>
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

On Tue, Dec 11, 2007 at 10:13:32AM -0800, Eugene Grayver wrote:
> Hello,
> We are working on some systems that require high sampling rates.  I am
> already using the Intel C++ compiler at the highest optimization ratio,
> but a lot of the blocks are very slow still.  It appears that intel C++
> does not properly vectorize <complex> data type.

General curiosity questions:

 Are you using oprofile to measure performance?

I am a bit of a maverick, and for various reasons am using a pure C++ environment.  I hacked my own 'connect_block' function (can;t wait for v3.2, where these will be part of native gr).  I am measuring the performance using a custom block (gr_throughput) that simply reports the average number of samples processed per second.

 What h/w platform are you running on / tuning for?

The platform is currently Intel Xeon or Core2 Duo.

 You're not trying to run your app on a cache-crippled machine like a
 Celeron, are you?  ;)

No, very high end.

 Which blocks are causing you the biggest problem?

I got a 2x improvement on all the filtering blocks.  About a 40% improvement for sine/cosine generation blocks.  This includes gr_expj, gr_rotate.

 Are your problems caused primarily by lack of CPU cycles, cache
 misses or mis-predicted branches?

I am not sure, since I am not at all a software expect (mostly dsp/comm).  My guess is that the SSE instructions are not being used (or not used to a full extent).  Even the 'multiply' block is VERY slow compared to a vector x vector multiplication in the Intel library.  Some of the gr_blocks process each sample using a separate function call (e.g.
for (n=0; n<noutput_samples; n++)

Replacing this with a single vectorized function call is much faster.

> I have been replacing almost every low level block with a functionally
> equivalent using the intel performance libraries (IPP).  These libraries
> are not GPL, but are free for noncommercial use under Linux ($200
> otherwise).  At some point, I would like to contribute our work back to
> gnuradio.  Would this fit with the gr philosophy?  How should we structure
> the code?  (i.e. have a separate set of files, use #defines, or ...)?
> Eugene

We would not accept the changes.  Part of what we're up to is building
an ever expanding universe of free code.  Instead of using the
non-free IPP code, please consider using a free library such as ATLAS,
or help us find and fix performance challenges in a way that doesn't
require non-free code.  Also, are you sure that your performance
issues can't be better addressed with an algorithmic change?  If
you're using a lot of very low-level blocks (e.g., add, multiply,
etc.) you're probably better off writing a block that aggregates some
of the operations into a single block.

That's what I expected.  We'll try to contribute the more dsp-centric blocks such as demodulators.  


reply via email to

[Prev in Thread] Current Thread [Next in Thread]