lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Using auto-vectorization


From: Greg Chicares
Subject: Re: [lmi] Using auto-vectorization
Date: Sat, 21 Jan 2017 11:21:12 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0

On 2017-01-21 01:20, Vadim Zeitlin wrote:
> On Fri, 20 Jan 2017 23:47:14 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> Would you like to propose a patch to 'expression_template_0_test.cpp'

I should also mention 'vector_test.cpp', which implements a tiny ET class
and compares its performance to std::valarray and hand-written C. It is
"of historical interest only" according to a comment, but comments aren't
always necessarily accurate.

> GC> so that we can measure what you'd prefer

I'm guessing that you'll prefer lambdas. If you like, you could just give
me your implementation parallel to mete_c() in each of those unit tests,
and I'll add it. Or would your preferred implementation simply replace
the classic for-loops with ranged ones? and, if so, could you show me how
you'd write mete_cxx11_typical() to parallel the other mete_*_typical(),
if that's a better test of what you have in mind?

> against the other methods
> GC> already tested there? It tests array lengths of 10*{0, 1, 2, 3, 4, 5}.
> GC> It would be extremely interesting to see whether auto-vectorization
> GC> has obviated the need for expression templates.
> 
>  At the first glance, it doesn't seem so. I've enabled auto-vectorization
> for gcc6, using -O3 (notice that by default it's disabled as we use -O2
> which doesn't include -ftree-vectorize) and as soon as it kicks in, which
> happens for N=100, it results in significant (although smaller than I
> thought, but maybe I was just unreasonably optimistic) gains for C,
> valarray and PETE versions and smaller gains for STL and μBLAS, so the
> former still remain faster.

Fascinating. BTW, due to problem-domain considerations, typical lmi
vectors are of size fifty or so, and probably always in [10, 100].

>  To give some numbers: for N=1000, C and PETE versions speed up is 48%
> (438ns with O3 against 846ns with O2) and valarray is almost 50%, however
> the difference between them is so small that it's inside the measurement
> error interval and they are all roughly equivalent. Plain STL version also
> gains 25%, meaning that it's even more slower than the fastest code: with
> O2 STL plain is ~2.35 times slower while with O3 it's still ~3.4 times
> slower. Fancy STL time is reduced by 35% with vectorization, but the end
> result is still the same: instead of being 1.3 times slower, it's 1.6 times
> slower with vectorization.
> 
>  So, if anything, using STL is even worse with auto-vectorization. But the
> excellent news is that compiler manages to auto-vectorize PETE code just as
> well as manual loops. And while it could be measurement error again, PETE
> somehow consistently manages to be faster than C version, although the
> effect is smaller and smaller as N increases, i.e.:
> 
>       N       PETE time in terms of C
>       -------------------------------
>           1    80%
>          10    83%
>         100    89%
>        1000    97%
>       10000   101%
> 
> I probably could spend more time looking at this, notably trying to
> understand what exactly is -fopt-info-vec-missed telling me...

If those differences actually aren't measurement error, then PETE
might be about fifteen percent faster than hand-coded C. But I'm
a little suspicious because the largest difference is for N=1,
where nothing should beat C because the loop is always unrolled
one time. Maybe we need to make the test functions perform more
arithmetic than they do at present.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]