[Top][All Lists]

## Re: [Discuss-gnuradio] VOLK division between complexes

 From: Federico Larroca Subject: Re: [Discuss-gnuradio] VOLK division between complexes Date: Wed, 11 May 2016 18:56:04 -0300

Marcus (Leech), I found the function you mentioned minutes after I sent the mail. Although it apparently works, Performance Monitor is behaving really weird when I use it. I have to look up that.
Marcus (Müller), a very informative answer indeed. I will see if I can get that endless fame you mention :-).
In any case, I'll post what I finally did and the performance gain achieved.
Best
Federico

2016-05-11 17:47 GMT-03:00 Marcus Müller :
Hi Federico,

On 11.05.2016 21:09, Federico Larroca wrote:
Hello everyone,
We are on the stage of optimizing our project (gr-isdbt).
Awesome!
One of the most consuming blocks is OFDM synchronization, and in particular the equalization phase. This is simply the division between the input signal and the estimated channel gains (two modestly big arrays of ~5000 complexes for each OFDM symbol).
Until now, this was performed by a for loop, so my plan was to change it for a volk function. However, there is no complex division in VOLK. So I've done a rather indirect operation using the property that a/b = a*conj(b)/|b|^2, resulting in six lines of code (a multiply conjugate, a magnitude squared, a deinterleave, a couple of float divisions and an interleave). Obviously the performance gain (measured with the Performance Monitor) is marginal (to be optimistic)...
I have to admit, I'd expect your "simple" for loop doing something like

void yourclass::normalize(std::complex<float> *a, std::complex<float> *b) {
for(size_t idx; idx < a_len; ++idx)
a[idx] /= b[idx];
}

to be neatly optimizable by the compiler, at least if it knows that a and b aren't pointing at the same memory-

is correct; however, in C++ with std::complex<>
a/b
pretty much does that already (ugly std lib C++ ahead, from /usr/include/c++/<version>/complex):
  // XXX: This is a grammar school implementation.
template<typename _Tp>
template<typename _Up>
complex<_Tp>&
complex<_Tp>::operator/=(const complex<_Up>& __z)
{
const _Tp __r =  _M_real * __z.real() + _M_imag * __z.imag();
const _Tp __n = std::norm(__z);
_M_imag = (_M_imag * __z.real() - _M_real * __z.imag()) / __n;
_M_real = __r / __n;
return *this;
}

And the problem is that while doing that for every a and b separately might mean you can't make full use of SIMD instructions to eg. do four complex divisions at once, it avoids having to load and store original / intermediate values from/to RAM. Basically, your CPU might not be the bottleneck – RAM could be, and doing everything you need for a single division at once, even if done without any optimization, might be faster than incurring additional memory transfers. That's because your memory controller pre-fetches whole cache lines worth of values when getting the first elements of a and b, and working on values from cache is significantly (read: factor > 50) than a single memory transfer.

So, my immediate recommendation really is to keep your loop as minimal as possible, giving your compiler a solid chance to see the potential for optimization. There might not be much you can do. Even hand-written VOLK kernels aren't always faster than automatically generated optimized machine code.
Does anyone has a better idea? Implementing a new kernel is simply out of my knowledge scope.
Ha! But it would mean endless (additional) fame!
Soooo: look at the volk_32fc_x2_multiply_conjugate_32fc.h kernel source. Specifically, at the SSE3 implementation, volk_32fc_x2_multiply_conjugate_32fc_u_sse3(…).
You'll notice line 134:
     z = _mm_complexconjugatemul_ps(x, y);


As you can see, there's a a "VOLK intrinsic",
_mm_complexconjugatemul_ps

which is defined in volk_intrinsics.h. That same file contains _mm_magnitudesquared_ps_sse3 . Maybe you can make something clever out of that :)

Best regards,
Marcus

[1] https://gcc.gnu.org/onlinedocs/gcc/Restricted-Pointers.html

_______________________________________________