discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Re-writing blocks using intel libraries


From: Eric Blossom
Subject: Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Date: Tue, 11 Dec 2007 16:06:49 -0800
User-agent: Mutt/1.5.17 (2007-11-01)

On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote:
> Please see answers in-line.
> 
> Thanks!

> General curiosity questions:
> 
>   Are you using oprofile to measure performance?
> 
> I am a bit of a maverick, and for various reasons am using a pure C++ 
> environment.  I hacked my own 'connect_block' function (can;t wait for 
> v3.2, where these will be part of native gr).

The trunk contains C++ code for connect, hier_block2, etc.  Some of
the pieces that are still missing include C++ support for the USRP
daughterboards, but Johnathan Corgan is working on that now.

> I am measuring the performance using a custom block (gr_throughput)
> that simply reports the average number of samples processed per
> second.

>   What h/w platform are you running on / tuning for?
> 
> The platform is currently Intel Xeon or Core2 Duo.
> 
>   You're not trying to run your app on a cache-crippled machine like a
>   Celeron, are you?  ;)
> 
> No, very high end.
> 
>   Which blocks are causing you the biggest problem?
> 
> I got a 2x improvement on all the filtering blocks.

If these are FIR filters, were you using gr_fft_filter_{fff,ccc}
or the gr_fir_filter* blocks?  The FFT one's are _much_ faster with a
break-even point around 16 taps IIRC.

> About a 40% improvement for sine/cosine generation blocks.  This
> includes gr_expj, gr_rotate.

No surprise there, and that's a great example of SIMD code that should
be in GNU Radio.

>   Are your problems caused primarily by lack of CPU cycles, cache
>   misses or mis-predicted branches?
> 
> I am not sure, since I am not at all a software expect (mostly dsp/comm). 
> My guess is that the SSE instructions are not being used (or not used to a 
> full extent).  Even the 'multiply' block is VERY slow compared to a vector 
> x vector multiplication in the Intel library.

OK.

> Some of the gr_blocks 
> process each sample using a separate function call (e.g. 
> for (n=0; n<noutput_samples; n++)
>         scale(in[n])
> 
> Replacing this with a single vectorized function call is much faster.

OK.

> > We would not accept the changes.

> That's what I expected.  We'll try to contribute the more dsp-centric 
> blocks such as demodulators. 

That would be great!  Or if you want to code up an SSE Taylor series
expansion for sine/cosine good to 23-bits or so, we'd love that too ;)

Thanks for telling us about your experience.

Eric




reply via email to

[Prev in Thread] Current Thread [Next in Thread]