[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Using auto-vectorization
From: |
Greg Chicares |
Subject: |
Re: [lmi] Using auto-vectorization |
Date: |
Wed, 25 Jan 2017 11:07:51 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0 |
On 2017-01-24 17:14, Vadim Zeitlin wrote:
> On Tue, 24 Jan 2017 04:11:10 +0000 Greg Chicares <address@hidden> wrote:
>
> GC> On 2017-01-24 02:49, Vadim Zeitlin wrote:
> GC> [...]
> GC> > ET-based code seems to profit from auto-vectorization just
> GC> > as well as everything else, so I don't see any reason to use anything
> else,
> GC> > especially if the code clarity and simplicity are the most important
> GC> > criteria.
> GC> >
> GC> > Now, whether using the particular PETE library is the best choice in
> 2017
> GC> > is another question and I suspect that it isn't, but I'm not aware of
> any
> GC> > critical problems with it neither.
> GC>
> GC> It seems that there was a flurry of interest around the turn of the
> GC> century, but almost none since then. The audience for ET libraries is
> GC> relatively small, and I'd guess that most potential users chose a
> GC> library long ago and aren't interested in changing.
>
> There are still a few actively developed libraries built on ET, e.g. Eigen
> (http://eigen.tuxfamily.org/) or Armadillo (http://arma.sourceforge.net/)
Both seem to be MPL2:
https://www.gnu.org/licenses/license-list.en.html#GPLCompatibleLicenses
Maybe we should take a look at them someday.
> and, of course, some older libraries such as Boost.uBLAS are still much
> newer than PETE.
Our universe is vectors of length one to one hundred, the most typical
length being about fifty; in that range, uBLAS is slow:
Running expression_template_0_test:
Speed tests: array length 1
C : 3.109e-007 s = 311 ns, mean of 32169 iterations
valarray : 1.686e-007 s = 169 ns, mean of 59310 iterations
uBLAS : 3.273e-007 s = 327 ns, mean of 30553 iterations
PETE : 1.680e-007 s = 168 ns, mean of 59539 iterations
Speed tests: array length 10
C : 1.764e-007 s = 176 ns, mean of 56685 iterations
valarray : 1.782e-007 s = 178 ns, mean of 56134 iterations
uBLAS : 3.407e-007 s = 341 ns, mean of 29357 iterations
PETE : 1.779e-007 s = 178 ns, mean of 56222 iterations
Speed tests: array length 100
C : 2.786e-007 s = 279 ns, mean of 35892 iterations
valarray : 2.799e-007 s = 280 ns, mean of 35728 iterations
uBLAS : 4.564e-007 s = 456 ns, mean of 21912 iterations
PETE : 2.776e-007 s = 278 ns, mean of 36027 iterations
It's about half the speed of valarray or PETE; it's even slower than the
"STL fancy" test:
/// v2 += v0 - 2.1 * v1;
std::transform
(sv0b.begin(),sv0b.end(),sv1b.begin(),tmp0.begin()
,std::bind
(std::minus<double>()
,std::placeholders::_1
,std::bind(std::multiplies<double>(),std::placeholders::_2,2.1) ) );
std::transform
(sv2b.begin(),sv2b.end(),tmp0.begin(),sv2b.begin()
,std::plus<double>() );
For N=10000, uBLAS beats "STL fancy", but not by much. It might be
great for linear algebra, but for our purposes it's not suitable.
> GC> > So, I guess, I'm still not sure what, if anything, should be done
> here? I
> GC> > can spend a lot of time profiling/benchmarking/debugging and it probably
> GC> > will result in at least some useful insights, but I can't propose any
> GC> > syntax better than the current ET-based one and so I'm still not sure
> what
> GC> > is my goal here.
> GC>
> GC> I think we're done for now. We aren't likely to find anything that
> GC> outperforms PETE. We can make greater use of it as time permits.
>
> Yes, I agree with this. However I think that you might still want to
> consider switching to -O3 (or adding just -ftree-vectorize?) as it seems to
> result in a "free" performance gain.
Maybe I should add a makefile target for the special purpose of testing
lmi's overall speed. Until then, this is probably a good test: run a
single census with 184 cells. Here, I used '--emit=emit_nothing' to
emphasize calculations, which are more likely to be helped by '-O3'
than report generation. The first set of timings use the '-02' binary
we would distribute today; the second set uses '-03' instead, but
otherwise all flags are the same.
/opt/lmi/src/lmi[0]$time wine
/opt/lmi/src/build/lmi/Linux/gcc/ship/lmi_cli_shared.exe
--file=/opt/lmi/test/sample.cns --accept --ash_nazg --data_path=/opt/lmi/data
--emit=emit_nothing >/dev/null
"-O2"
12.36s user 0.30s system 96% cpu 13.059 total
11.72s user 0.34s system 96% cpu 12.447 total
12.35s user 0.30s system 96% cpu 13.050 total
/opt/lmi/src/lmi[0]$time wine
/opt/lmi/src/build/lmi/Linux/gcc/fastest/lmi_cli_shared.exe
--file=/opt/lmi/test/sample.cns --accept --ash_nazg --data_path=/opt/lmi/data
--emit=emit_nothing >/dev/null
"-O3"
11.60s user 0.30s system 96% cpu 12.264 total
11.53s user 0.31s system 96% cpu 12.231 total
12.32s user 0.30s system 96% cpu 13.026 total
It doesn't seem to make a large difference. Perhaps it did for you
with 64-bit builds, or at least with SSE?
I tried increasing the priority, but...
/opt/lmi/src/lmi[127]$time sudo nice --10 wine
/opt/lmi/src/build/lmi/Linux/gcc/fastest/lmi_cli_shared.exe
--file=/opt/lmi/test/sample.cns --accept --ash_nazg --data_path=/opt/lmi/data
--emit=emit_nothing >/dev/null
wine: created the configuration directory '/root/.wine'
No protocol specified
Application tried to create a window, but no driver could be loaded.
Make sure that your X server is running and that $DISPLAY is set correctly.
/opt/lmi/src/lmi[53]$sudo rm -rf /root/.wine
...apparently 'wine' needs to create a hidden window.
Repeating the tests as above, but a few hours later and with a
slightly different phase of the moon:
"-O2"
12.22s user 0.31s system 97% cpu 12.909 total
11.98s user 0.32s system 96% cpu 12.683 total
12.14s user 0.30s system 96% cpu 12.872 total
"-O3"
12.32s user 0.31s system 97% cpu 13.018 total
11.97s user 0.34s system 96% cpu 12.712 total
12.22s user 0.31s system 97% cpu 12.913 total
Now, comparing the "total" vectors element by element,
"-O2" is faster than "-O3" in each of the three pairs,
but the differences are not significant.
- Re: [lmi] Replacing boost with std C++11, (continued)
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/20
- Re: [lmi] Replacing boost with std C++11, Vadim Zeitlin, 2017/01/20
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/20
- Re: [lmi] Using auto-vectorization (was: Replacing boost with std C++11), Vadim Zeitlin, 2017/01/20
- Re: [lmi] Using auto-vectorization, Greg Chicares, 2017/01/21
- Re: [lmi] Using auto-vectorization, Vadim Zeitlin, 2017/01/23
- Re: [lmi] Using auto-vectorization, Greg Chicares, 2017/01/23
- Re: [lmi] Using auto-vectorization, Vadim Zeitlin, 2017/01/23
- Re: [lmi] Using auto-vectorization, Greg Chicares, 2017/01/23
- Re: [lmi] Using auto-vectorization, Vadim Zeitlin, 2017/01/24
- Re: [lmi] Using auto-vectorization,
Greg Chicares <=
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/21
- Re: [lmi] Replacing boost with std C++11, Vadim Zeitlin, 2017/01/22
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/23