[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] VOLK division between complexes

From: Federico Larroca
Subject: Re: [Discuss-gnuradio] VOLK division between complexes
Date: Tue, 17 May 2016 15:33:01 -0300

Hello again,

Thank you Marcus por looking through my code (and the positive comments).
I have several things to report:
 - Pulling from your repo and using volk_32fc_x2_divide_32fc worked perfectly, and gr-isdbt kept operating as usual.
 - Substituting yours with my AVX proto-kernel (plus the aligned version) works too.
 - Regarding performance, they are almost indistinguishable, at least in my PC: running volk_profile they are roughly 10x faster than the generic implementation, and Performance Monitor does not show any real difference between both implementations (here I used the complete receiving chain).
 - The performance gain (again, measured by using Performance Monitor with the complete receiving chain) between using the four lines of code I sent before, and the divide kernel is mostly null.

So, to conclude, maybe the performance gain of using a kernel instead of four lines of code is not significant, but I believe it's both simpler to use and easier to read (and performance should be further tested in other setups to actually conclude the former). Regarding the implementations, both implementations of the AVX kernel are, from my unexperienced perspective, mostly identical (maybe mine is a little bit simpler to read, but it was me who wrote it, so I'm no judge). I've seen you already made a pull request to add it to gnuradio, so my opinion is to use yours (but feel free to use/test mine if you prefer).

In any case, this was a very interesting and formative experience.

2016-05-15 6:06 GMT-03:00 Marcus Müller <address@hidden>:
Hi Federico

On 15.05.2016 02:40, Federico Larroca wrote:
> That was fast!
Only ten times as fast as the generic, pure C implementation, but thank
you :)
> Thank you very much!
You're welcome :)
> I don't have access to my computer for the weekend, but I'll check it
> as soon as I get back to the University on tuesday (monday's holiday
> here).
> In any case, I got to halfway implementing the AVX kernel, which I
> copy below just for the record... I didn't even got to compile it, let
> alone test it, but I surely learned a lot.
Yeah, it was my first kernel, too :) Learned a lot!
> static inline void
> volk_32fc_x2_divide_32fc_u_avx(lv_32fc_t* cVector, const lv_32fc_t*
> aVector,
>                                            const lv_32fc_t* bVector,
> unsigned int num_points)
> {
>   unsigned int number = 0;
>   const unsigned int quarterPoints = num_points / 4;
>   __m256 x, y, z, sq, mag_sq, mag_sq_un, div;
>   lv_32fc_t* c = cVector;
>   const lv_32fc_t* a = aVector;
>   const lv_32fc_t* b = bVector;
>   for(; number < quarterPoints; number++){
>     x = _mm256_loadu_ps((float*) a); // Load the ar + ai, br + bi ...
> as ar,ai,br,bi ...
>     y = _mm256_loadu_ps((float*) b); // Load the cr + ci, dr + di ...
> as cr,ci,dr,di ...
>     z = _mm256_complexconjugatemul_ps(x, y);
>     sq = _mm256_mul_ps(y, y); // Square the values
>     mag_sq_un = _mm256_hadd_ps(w,w); // obtain the actual squared
> magnitude, although out of order
you mean ... _hadd_ps(sq,sq), right?
>     mag_sq = _mm256_permute_ps(mag_sq_un, 0xd8) // I order it
ah, clever move! Very clever indeed!
What you do is get four complex values at once, then calculate a b*,
then calculate
|b0|² |b1|² |b2|² |b3|² |b0|² |b1|² |b2|² |b3|²
and then reorder it in memory to be
|b0|² |b0|² |b1|² |b1|² |b2|² |b2|² |b3|² |b3|²
right? (still haven't gotten around being able to read the
shuffle/permute masks, and a bit too lazy to do so, now).

>     div = _mm256_div_ps(z,mag_sq);
>     _mm256_storeu_ps((float*) c, div); // Store the results back into
> the C container
>     a += 4;
>     b += 4;
>     c += 4;
>   }
> (I got this far ).
Looks pretty solid to me!

So the difference between my and your AVX kernel is that my kernel loads
a total of eight a,b complexes at once, basically because the
_mm256_mul/_mm256_hadd step can produce eight |b|² at once – and then I
really struggled (but managed) to have each of these |b|² twice, so I
can do the two _mm256_div. Your approach is so much cleverer, because it
uses less registers, and less obscure shuffling.

My AVX kernel, on my machine, is about as fast as my SSE3 kernel. So I'd
really like to ask you to try mine, and then just replace my AVX code
with yours, and compare the results. I think yours might be
significantly faster!

Best regards,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]