discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Re-writing blocks using intel libraries


From: Martin Dvh
Subject: Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Date: Wed, 12 Dec 2007 17:03:12 +0100
User-agent: Icedove 1.5.0.14pre (X11/20071018)

Tom Rondeau wrote:
> Martin Dvh wrote:
>> Eric Blossom wrote:
>>  
>>> On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote:
>>>    
>>>> Please see answers in-line.
>>>>
>>>> Thanks!
>>>>       General curiosity questions:
>>>>
>>>>   Are you using oprofile to measure performance?
>>>>
>>>> I am a bit of a maverick, and for various reasons am using a pure
>>>> C++ environment.  I hacked my own 'connect_block' function (can;t
>>>> wait for v3.2, where these will be part of native gr).
>>>>       
>>> The trunk contains C++ code for connect, hier_block2, etc.  Some of
>>> the pieces that are still missing include C++ support for the USRP
>>> daughterboards, but Johnathan Corgan is working on that now.
>>>
>>>    
>>>> I am measuring the performance using a custom block (gr_throughput)
>>>> that simply reports the average number of samples processed per
>>>> second.
>>>>         What h/w platform are you running on / tuning for?
>>>>
>>>> The platform is currently Intel Xeon or Core2 Duo.
>>>>
>>>>   You're not trying to run your app on a cache-crippled machine like a
>>>>   Celeron, are you?  ;)
>>>>
>>>> No, very high end.
>>>>
>>>>   Which blocks are causing you the biggest problem?
>>>>
>>>> I got a 2x improvement on all the filtering blocks.
>>>>       
>>> If these are FIR filters, were you using gr_fft_filter_{fff,ccc}
>>> or the gr_fir_filter* blocks?  The FFT one's are _much_ faster with a
>>> break-even point around 16 taps IIRC.
>>>
>>>    
>>>> About a 40% improvement for sine/cosine generation blocks.  This
>>>> includes gr_expj, gr_rotate.
>>>>       
>>> No surprise there, and that's a great example of SIMD code that should
>>> be in GNU Radio.
>>>
>>>    
>>>>   Are your problems caused primarily by lack of CPU cycles, cache
>>>>   misses or mis-predicted branches?
>>>>
>>>> I am not sure, since I am not at all a software expect (mostly
>>>> dsp/comm). My guess is that the SSE instructions are not being used
>>>> (or not used to a full extent).  Even the 'multiply' block is VERY
>>>> slow compared to a vector x vector multiplication in the Intel library.
>>>>       
>>> OK.
>>>
>>>    
>>>> Some of the gr_blocks process each sample using a separate function
>>>> call (e.g. for (n=0; n<noutput_samples; n++)
>>>>         scale(in[n])
>>>>
>>>> Replacing this with a single vectorized function call is much faster.
>>>>       
>>> OK.
>>>
>>>    
>>>>> We would not accept the changes.
>>>>>         
>>>> That's what I expected.  We'll try to contribute the more
>>>> dsp-centric blocks such as demodulators.       
>>> That would be great!  Or if you want to code up an SSE Taylor series
>>> expansion for sine/cosine good to 23-bits or so, we'd love that too ;)
>>>     
>> I am working on this in the little spare time I have.
>> I already got a SSE taylor series for atan2, working in gnuradio.
>> The atan2 needs some code cleanup and wrapper code to switch
>> implementations (if (processor=X86, processor
>> supports_SSE2)=>optimized else generic)
>> The sin/cos is far from ready.
>>
>> Greetings,
>> Martin
>>   
> 
> Martin,
> 
> Bob put in a fast atan function (general/gr_fast_atan2f.cc) about a year
> ago. Have you looked in this, and is the Taylor performance better?
The taylor performance is much better when you get (a multiple of) 4 atan2s at 
a time.
(because the SSE taylor series works with SIMD in blocks of 4)
When you only get one at a time, the performance is still better but not by 
much.
The taylor series also is more precise then gr_fast_atan2f.cc
I don't have the numbers at hand, but I also wrote qa and benchmark code so 
exact numbers on precision and speed can be determined.

As a side note:
I have also been working on a new version off the FFT FIR filter.
This one is more efficient when decimating.
inverse_FFT_size=forward_FFT_size/decimation
This works very well when decimation is 2^n, it also works well for most other 
decimation factors EXCEPT when decimation is a big prime.

This means the theoretical maximum speed improvement is a factor two (when 
decimation is infinite)
But when you want multiple parts of the spectrum then the speed improvement is 
much better then using a FIR filter per spectrum part.
Then you can use a single forward FFT with multiple inverse FFTs.

Greetings,
Martin

> We really need a faster sin/cos. Glad to hear you're working on it.
> 
> Tom

> 
> 





reply via email to

[Prev in Thread] Current Thread [Next in Thread]