On 01/17/2012 07:54 PM, Nick Foster wrote:
On Tue, Jan 17, 2012 at 10:36 AM, Josh
Blum <address@hidden>
wrote:
On 01/16/2012 09:51 AM, ziyang wrote:
> On 01/13/2012 09:30 PM, Josh Blum wrote:
>>> To reduce the computation load of the
processor, I tried two methods:
>>> 1) modify the gr.quadrature_demod_cf block,
replace some multiplication
>>> operations with volk-based operations
(gr.multiply and gr.multiply_const
>>> modules in gr_blocks);
>> I like it. Make sure to contribute patches like
that back. :-)
> Actually, what I did was writing a new quadrature_demod
block without
> the multiplication and delay operations, and connect
extra gr.multiply
> and gr.delay blocks instead in the flow graph. Because
my understanding
> is that the volk functions take a vector (multiple
values) as input, and
> I didn't figure out a way to do the
single-item-operation in the volk
> style.
>
I dont recommend using the extra blocks, that would probably
cause more
overhead. Looking at gr_quadrature_demod_cf::work, it looks
like you can
vectorize the operation of the conjugate multiply, then the
atan, then
the gain scaler. So, that would be one for loop that operates
on 4
samples at a time, and calls 3 volk functions.
Right now, the Volk atan2 function is only implemented for
SSE and only works if libsimdmath is installed. If not, it
will fall back to a generic implementation which is
considerably slower than Gnuradio's LUT atan2. There's no NEON
implementation, so right now the fastest option on E100 is to
use Gnuradio's built-in atan2.
I spent some quality time a couple of months ago during SDR
Forum writing a vectorized atan2 algorithm in Volk via Orc. I
was unable to get the entire algorithm to fit within the
register constraints the Orc runtime compiler applies. The end
goal is to get the entire algorithm vectorized so it only
needs to write out to memory once, which is going to be far
faster than running three vector operations across a large
buffer which won't fit into cache. I'll get back to it one of
these days but it looks like parts of Orc's compiler will have
to be improved. Terry, if you're interested, Orc code is
easily read and looks like vector pseudocode, so my Orc
implementation might be of use if you're interested in writing
a custom NEON implementation for Volk. It's based on the
libsimdmath implementation, which is in turn based on Cephes,
and uses all sorts of Crazy Math Tricks.
--n
Thank you for your help, Nicks. Right now, I really want to have a
faster atan implementation, but I use python and occationally c++
for most of the time, so I'm not sure if I can handle the custom
NEON implementation because these Orc / NEON / libsmdmath / Cephes
are all completely new to me.
Thanks.
Best Regards,
Terry
>> Also, you may consider timing a particular
operation as a performance
>> metric, rather than counting the number of
demodulated packets.
>>
> I was wondering if there are examples from which I can
learn how to do
> this?
Sorry, I guess there isnt much in the way of examples.
You can time individual work functions by adding some code
before an
after. We have some high resolution timers in
gruel/include/gruel/high_res_timers.h
I have also seen people time the block in a simple flow graph
with a
null source, head, your_block, null_sink. You can time
tb.run() and
compare run duration vs the non-vectorized code.
-Josh
_______________________________________________
Discuss-gnuradio mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
|