[Top][All Lists]

## Re: [Discuss-gnuradio] VOLK division between complexes

 From: Federico Larroca Subject: Re: [Discuss-gnuradio] VOLK division between complexes Date: Sat, 14 May 2016 21:40:23 -0300

That was fast! Thank you very much! I don't have access to my computer for the weekend, but I'll check it as soon as I get back to the University on tuesday (monday's holiday here).
In any case, I got to halfway implementing the AVX kernel, which I copy below just for the record... I didn't even got to compile it, let alone test it, but I surely learned a lot.
Best
Federico

static inline void
volk_32fc_x2_divide_32fc_u_avx(lv_32fc_t* cVector, const lv_32fc_t* aVector,
const lv_32fc_t* bVector, unsigned int num_points)
{
unsigned int number = 0;
const unsigned int quarterPoints = num_points / 4;

__m256 x, y, z, sq, mag_sq, mag_sq_un, div;
lv_32fc_t* c = cVector;
const lv_32fc_t* a = aVector;
const lv_32fc_t* b = bVector;

for(; number < quarterPoints; number++){
x = _mm256_loadu_ps((float*) a); // Load the ar + ai, br + bi ... as ar,ai,br,bi ...
y = _mm256_loadu_ps((float*) b); // Load the cr + ci, dr + di ... as cr,ci,dr,di ...
z = _mm256_complexconjugatemul_ps(x, y);
sq = _mm256_mul_ps(y, y); // Square the values
mag_sq_un = _mm256_hadd_ps(w,w); // obtain the actual squared magnitude, although out of order
mag_sq = _mm256_permute_ps(mag_sq_un, 0xd8) // I order it
div = _mm256_div_ps(z,mag_sq);

_mm256_storeu_ps((float*) c, div); // Store the results back into the C container

a += 4;
b += 4;
c += 4;
}

(I got this far ).

2016-05-14 14:47 GMT-03:00 Marcus Müller :
Hi Federico,

could you have a look at this branch [1] and tell me whether it works for you? It's SSE3 and AVX only, so far.
Basically, you could

git pull https://github.com/marcusmueller/volk complex_division
cd ../build
make
make install

and use volk_32fc_x2_divide_32fc in your code.

Best regards,
Marcus

[1] https://github.com/marcusmueller/volk/tree/complex_division

On 11.05.2016 23:56, Federico Larroca wrote:
Marcus (Leech), I found the function you mentioned minutes after I sent the mail. Although it apparently works, Performance Monitor is behaving really weird when I use it. I have to look up that.
Marcus (Müller), a very informative answer indeed. I will see if I can get that endless fame you mention :-).
In any case, I'll post what I finally did and the performance gain achieved.
Best
Federico

2016-05-11 17:47 GMT-03:00 Marcus Müller :
Hi Federico,

On 11.05.2016 21:09, Federico Larroca wrote:
Hello everyone,
We are on the stage of optimizing our project (gr-isdbt).
Awesome!
One of the most consuming blocks is OFDM synchronization, and in particular the equalization phase. This is simply the division between the input signal and the estimated channel gains (two modestly big arrays of ~5000 complexes for each OFDM symbol).
Until now, this was performed by a for loop, so my plan was to change it for a volk function. However, there is no complex division in VOLK. So I've done a rather indirect operation using the property that a/b = a*conj(b)/|b|^2, resulting in six lines of code (a multiply conjugate, a magnitude squared, a deinterleave, a couple of float divisions and an interleave). Obviously the performance gain (measured with the Performance Monitor) is marginal (to be optimistic)...
I have to admit, I'd expect your "simple" for loop doing something like

void yourclass::normalize(std::complex<float> *a, std::complex<float> *b) {
for(size_t idx; idx < a_len; ++idx)
a[idx] /= b[idx];
}

to be neatly optimizable by the compiler, at least if it knows that a and b aren't pointing at the same memory-

is correct; however, in C++ with std::complex<>
a/b
pretty much does that already (ugly std lib C++ ahead, from /usr/include/c++/<version>/complex):
  // XXX: This is a grammar school implementation.
template<typename _Tp>
template<typename _Up>
complex<_Tp>&
complex<_Tp>::operator/=(const complex<_Up>& __z)
{
const _Tp __r =  _M_real * __z.real() + _M_imag * __z.imag();
const _Tp __n = std::norm(__z);
_M_imag = (_M_imag * __z.real() - _M_real * __z.imag()) / __n;
_M_real = __r / __n;
return *this;
}

And the problem is that while doing that for every a and b separately might mean you can't make full use of SIMD instructions to eg. do four complex divisions at once, it avoids having to load and store original / intermediate values from/to RAM. Basically, your CPU might not be the bottleneck – RAM could be, and doing everything you need for a single division at once, even if done without any optimization, might be faster than incurring additional memory transfers. That's because your memory controller pre-fetches whole cache lines worth of values when getting the first elements of a and b, and working on values from cache is significantly (read: factor > 50) than a single memory transfer.

So, my immediate recommendation really is to keep your loop as minimal as possible, giving your compiler a solid chance to see the potential for optimization. There might not be much you can do. Even hand-written VOLK kernels aren't always faster than automatically generated optimized machine code.
Does anyone has a better idea? Implementing a new kernel is simply out of my knowledge scope.
Ha! But it would mean endless (additional) fame!
Soooo: look at the volk_32fc_x2_multiply_conjugate_32fc.h kernel source. Specifically, at the SSE3 implementation, volk_32fc_x2_multiply_conjugate_32fc_u_sse3(…).
You'll notice line 134:
     z = _mm_complexconjugatemul_ps(x, y);


As you can see, there's a a "VOLK intrinsic",
_mm_complexconjugatemul_ps

which is defined in volk_intrinsics.h. That same file contains _mm_magnitudesquared_ps_sse3 . Maybe you can make something clever out of that :)

Best regards,
Marcus

[1] https://gcc.gnu.org/onlinedocs/gcc/Restricted-Pointers.html

_______________________________________________