[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Discuss-gnuradio] VOLK division between complexes
From: |
Marcus Müller |
Subject: |
Re: [Discuss-gnuradio] VOLK division between complexes |
Date: |
Sun, 15 May 2016 11:06:56 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 |
Hi Federico
On 15.05.2016 02:40, Federico Larroca wrote:
> That was fast!
Only ten times as fast as the generic, pure C implementation, but thank
you :)
> Thank you very much!
You're welcome :)
> I don't have access to my computer for the weekend, but I'll check it
> as soon as I get back to the University on tuesday (monday's holiday
> here).
> In any case, I got to halfway implementing the AVX kernel, which I
> copy below just for the record... I didn't even got to compile it, let
> alone test it, but I surely learned a lot.
Yeah, it was my first kernel, too :) Learned a lot!
> static inline void
> volk_32fc_x2_divide_32fc_u_avx(lv_32fc_t* cVector, const lv_32fc_t*
> aVector,
> const lv_32fc_t* bVector,
> unsigned int num_points)
> {
> unsigned int number = 0;
> const unsigned int quarterPoints = num_points / 4;
>
> __m256 x, y, z, sq, mag_sq, mag_sq_un, div;
> lv_32fc_t* c = cVector;
> const lv_32fc_t* a = aVector;
> const lv_32fc_t* b = bVector;
>
> for(; number < quarterPoints; number++){
> x = _mm256_loadu_ps((float*) a); // Load the ar + ai, br + bi ...
> as ar,ai,br,bi ...
> y = _mm256_loadu_ps((float*) b); // Load the cr + ci, dr + di ...
> as cr,ci,dr,di ...
> z = _mm256_complexconjugatemul_ps(x, y);
> sq = _mm256_mul_ps(y, y); // Square the values
> mag_sq_un = _mm256_hadd_ps(w,w); // obtain the actual squared
> magnitude, although out of order
you mean ... _hadd_ps(sq,sq), right?
> mag_sq = _mm256_permute_ps(mag_sq_un, 0xd8) // I order it
ah, clever move! Very clever indeed!
What you do is get four complex values at once, then calculate a b*,
then calculate
|b0|² |b1|² |b2|² |b3|² |b0|² |b1|² |b2|² |b3|²
and then reorder it in memory to be
|b0|² |b0|² |b1|² |b1|² |b2|² |b2|² |b3|² |b3|²
right? (still haven't gotten around being able to read the
shuffle/permute masks, and a bit too lazy to do so, now).
> div = _mm256_div_ps(z,mag_sq);
>
> _mm256_storeu_ps((float*) c, div); // Store the results back into
> the C container
>
> a += 4;
> b += 4;
> c += 4;
> }
>
> (I got this far ).
Looks pretty solid to me!
So the difference between my and your AVX kernel is that my kernel loads
a total of eight a,b complexes at once, basically because the
_mm256_mul/_mm256_hadd step can produce eight |b|² at once – and then I
really struggled (but managed) to have each of these |b|² twice, so I
can do the two _mm256_div. Your approach is so much cleverer, because it
uses less registers, and less obscure shuffling.
My AVX kernel, on my machine, is about as fast as my SSE3 kernel. So I'd
really like to ask you to try mine, and then just replace my AVX code
with yours, and compare the results. I think yours might be
significantly faster!
Best regards,
Marcus