[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lmi] gcc -flto [Was: libstdc++ anomaly?]
From: |
Greg Chicares |
Subject: |
[lmi] gcc -flto [Was: libstdc++ anomaly?] |
Date: |
Sat, 24 Dec 2016 14:14:31 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.4.0 |
On 2016-12-19 15:10, Vadim Zeitlin wrote:
[...]
> BTW, another thing that I thought about while discussing this: the -flto
> option also came up and I wrote that it indeed allowed the compiler to
> compute the result at compile-time in this simple example, but that this
> wouldn't work in the real program. However now I'm not so sure: if you're
> using pow() just to build the cache of the powers of 10 for not too many
> exponents, wouldn't gcc be indeed smart enough to precompute all of them at
> compile-time? Of course, lmi doesn't use LTO currently, but perhaps it
> could be worth testing turning it on and checking how it affects the
> performance? We can clearly see that it allows for impressive optimizations
> in simple examples and while nothing guarantees that it would be also the
> case in real code, it might be worth trying it out.
It seemed simple enough to try. First of all, we have to specify '-flto'
in CFLAGS, CXXFLAGS, and LDFLAGS; all of them include $(gprof_flag), so
that's easy. And we apparently have to turn off debugging:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
| Link-time optimization does not work well with generation of debugging
| information. Combining -flto with -g is currently experimental and
| expected to produce unexpected results.
(lmi still follows the classic GNU suggestion always to build with debug
information, so we'd have to consider that before distributing '-flto'
binaries. The only benefit to me personally is that I can use gdb to get
a backtrace if lmi crashes; I might do that once a year or so.)
Anyway, here's what we now distribute, without '-flto':
/opt/lmi/src/lmi[0]$make clean
rm --force --recursive /opt/lmi/src/lmi/../build/lmi/Linux/gcc/ship
/opt/lmi/src/lmi[0]$time make $coefficiency install check_physical_closure
>../log 2>&1
make $coefficiency install check_physical_closure > ../log 2>&1 1126.93s user
59.27s system 2497% cpu 47.497 total
/opt/lmi/src/lmi[0]$time make $coefficiency system_test >../log 2>&1
make $coefficiency system_test > ../log 2>&1 215.09s user 70.56s system 627%
cpu 45.508 total
47 seconds to build; 46 seconds to run a regression test with 1300 cells.
With '-flto':
/opt/lmi/src/lmi[0]$make clean
rm --force --recursive /opt/lmi/src/lmi/../build/lmi/Linux/gcc/ship
/opt/lmi/src/lmi[0]$time make debug_flag= gprof_flag="-flto=8" $coefficiency
install check_physical_closure >../log 2>&1
make debug_flag= gprof_flag="-flto=8" $coefficiency install > ../log 2>&1
1162.64s user 66.25s system 1962% cpu 1:02.63 total
63 seconds, but it fails, printing 63 (same number by coincidence) error
messages like this:
/tmp/ccb3DVbp.ltrans0.ltrans.o:ccb3DVbp.ltrans0.o:(.text+0x2851): undefined
reference to `_imp___ZTV20wxObjectEventFunctor.lto_priv.608'
many of which seem to be duplicates.
We can measure performance despite that, because the regression test
uses 'lmi_cli_shared', not the wx binary. To run it, we need to make
this temporary change to ignore the wx linkage problems:
-system_test: $(data_dir)/configurable_settings.xml $(touchstone_md5sums)
install
+system_test: $(data_dir)/configurable_settings.xml $(touchstone_md5sums)
/opt/lmi/src/lmi[2]$time make $coefficiency system_test >../log 2>&1
make $coefficiency system_test > ../log 2>&1 210.93s user 71.58s system 658%
cpu 42.895 total
/opt/lmi/src/lmi[0]$cat ../log
System test:
All 1505 files match.
Improvement: (45.508 - 42.895) / 45.508 = six percent. Removing the
'install' prerequisite makes the comparison unfair because it has a
significant cost even when the build is up to date, viz.:
/opt/lmi/src/lmi[0]$time make $coefficiency install check_physical_closure
>../log 2>&1
make $coefficiency install check_physical_closure > ../log 2>&1 2.36s user
0.12s system 95% cpu 2.596 total
Adding that back in: 2.596 + 42.895 = 45.491, so the improvement is
probably (45.508 - 45.491) / 45.508 = four hundredths of a percent.
But that's not the best way to test: we really should measure only
the time spent running lmi, not 'ihs_crc_comp'; and running 32 lmi
instances in parallel means we're really measuring only the one that
takes the longest--so, repeating with this temporary change:
--------8<--------8<--------8<--------
diff --git a/workhorse.make b/workhorse.make
index ac2e07a..f86456f 100644
--- a/workhorse.make
+++ b/workhorse.make
@@ -1297,6 +1297,8 @@ $(testdecks):
--pyx=system_testing \
--file=$@
@$(MD5SUM) --binary $(basename $(notdir $@)).* >> $(system_test_md5sums)
+
+xyzzy0:
@for z in $(dot_test_files); \
do \
$(PERFORM) $(bin_dir)/ihs_crc_comp$(EXEEXT) $$z
$(touchstone_dir)/$$z \
@@ -1305,12 +1307,14 @@ $(testdecks):
done
.PHONY: system_test
-system_test: $(data_dir)/configurable_settings.xml $(touchstone_md5sums)
install
+system_test: $(data_dir)/configurable_settings.xml $(touchstone_md5sums)
@$(ECHO) System test:
@$(RM) --force $(addprefix $(test_dir)/*., $(test_result_suffixes))
@[ "$(strip $(testdecks))" != "" ] || ( $(ECHO) No testdecks. && false )
@testdecks=`$(LS) --sort=size $(testdecks) || $(ECHO) $(testdecks)` \
&& $(MAKE) --file=$(this_makefile) --directory=$(test_dir) $$testdecks
+
+xyzzy1:
@$(SORT) --output=$(system_test_analysis) $(system_test_analysis)
@$(SORT) --key=2 --output=$(system_test_md5sums) $(system_test_md5sums)
@$(CP) --preserve --update $(system_test_md5sums)
$(system_test_md5sums2)
-------->8-------->8-------->8--------
and without any "$coefficiency" parallelism, with '-flto' we get:
/opt/lmi/src/lmi[0]$time make system_test
System test:
make system_test 119.40s user 17.83s system 93% cpu 2:27.30 total
while without '-flto' it's:
/opt/lmi/src/lmi[0]$time make system_test
System test:
make system_test 120.00s user 17.86s system 93% cpu 2:28.21 total
Improvement: (148.21 - 147.30) / 148.21 = six tenths of a percent, which
doesn't justify significantly slower builds and giving up '-ggdb'. Alas:
I really hoped to put those idle cores to good use when linking.
- Re: [lmi] MinGW-w64 anomaly?, (continued)
- [lmi] Optimized integral power [Was: MinGW-w64 anomaly?], Greg Chicares, 2016/12/21
- Re: [lmi] Optimized integral power, Vadim Zeitlin, 2016/12/22
- Re: [lmi] Optimized integral power, Greg Chicares, 2016/12/22
- Re: [lmi] Optimized integral power, Vadim Zeitlin, 2016/12/22
- Re: [lmi] Optimized integral power, Greg Chicares, 2016/12/23
- Re: [lmi] libstdc++ anomaly? [was: MinGW-w64 anomaly?], Vadim Zeitlin, 2016/12/19
- [lmi] gcc -flto [Was: libstdc++ anomaly?],
Greg Chicares <=
- Re: [lmi] gcc -flto, Vadim Zeitlin, 2016/12/24
- Re: [lmi] gcc -flto, Greg Chicares, 2016/12/24
- [lmi] gcc -fprofile-generate and -fprofile-use [Was: gcc -flto], Greg Chicares, 2016/12/27
- [lmi] gcc -fprofile-generate and -fprofile-use [Was: gcc -flto], Greg Chicares, 2016/12/27
- Re: [lmi] gcc -fprofile-generate and -fprofile-use [Was: gcc -flto], Vadim Zeitlin, 2016/12/27
- Re: [lmi] gcc -fprofile-generate and -fprofile-use [Was: gcc -flto], Greg Chicares, 2016/12/27