lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] gcc -flto [Was: libstdc++ anomaly?]


From: Greg Chicares
Subject: [lmi] gcc -flto [Was: libstdc++ anomaly?]
Date: Sat, 24 Dec 2016 14:14:31 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.4.0

On 2016-12-19 15:10, Vadim Zeitlin wrote:
[...]
>  BTW, another thing that I thought about while discussing this: the -flto
> option also came up and I wrote that it indeed allowed the compiler to
> compute the result at compile-time in this simple example, but that this
> wouldn't work in the real program. However now I'm not so sure: if you're
> using pow() just to build the cache of the powers of 10 for not too many
> exponents, wouldn't gcc be indeed smart enough to precompute all of them at
> compile-time? Of course, lmi doesn't use LTO currently, but perhaps it
> could be worth testing turning it on and checking how it affects the
> performance? We can clearly see that it allows for impressive optimizations
> in simple examples and while nothing guarantees that it would be also the
> case in real code, it might be worth trying it out.

It seemed simple enough to try. First of all, we have to specify '-flto'
in CFLAGS, CXXFLAGS, and LDFLAGS; all of them include $(gprof_flag), so
that's easy. And we apparently have to turn off debugging:

  https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
| Link-time optimization does not work well with generation of debugging
| information. Combining -flto with -g is currently experimental and
| expected to produce unexpected results.

(lmi still follows the classic GNU suggestion always to build with debug
information, so we'd have to consider that before distributing '-flto'
binaries. The only benefit to me personally is that I can use gdb to get
a backtrace if lmi crashes; I might do that once a year or so.)

Anyway, here's what we now distribute, without '-flto':

/opt/lmi/src/lmi[0]$make clean
rm --force --recursive /opt/lmi/src/lmi/../build/lmi/Linux/gcc/ship
/opt/lmi/src/lmi[0]$time make $coefficiency install check_physical_closure 
>../log 2>&1
make $coefficiency install check_physical_closure > ../log 2>&1  1126.93s user 
59.27s system 2497% cpu 47.497 total
/opt/lmi/src/lmi[0]$time make $coefficiency system_test >../log 2>&1
make $coefficiency system_test > ../log 2>&1  215.09s user 70.56s system 627% 
cpu 45.508 total

47 seconds to build; 46 seconds to run a regression test with 1300 cells.

With '-flto':

/opt/lmi/src/lmi[0]$make clean
rm --force --recursive /opt/lmi/src/lmi/../build/lmi/Linux/gcc/ship
/opt/lmi/src/lmi[0]$time make debug_flag= gprof_flag="-flto=8" $coefficiency 
install check_physical_closure >../log 2>&1 
make debug_flag= gprof_flag="-flto=8" $coefficiency install  > ../log 2>&1  
1162.64s user 66.25s system 1962% cpu 1:02.63 total

63 seconds, but it fails, printing 63 (same number by coincidence) error
messages like this:

/tmp/ccb3DVbp.ltrans0.ltrans.o:ccb3DVbp.ltrans0.o:(.text+0x2851): undefined 
reference to `_imp___ZTV20wxObjectEventFunctor.lto_priv.608'

many of which seem to be duplicates.

We can measure performance despite that, because the regression test
uses 'lmi_cli_shared', not the wx binary. To run it, we need to make
this temporary change to ignore the wx linkage problems:

-system_test: $(data_dir)/configurable_settings.xml $(touchstone_md5sums) 
install
+system_test: $(data_dir)/configurable_settings.xml $(touchstone_md5sums)

/opt/lmi/src/lmi[2]$time make $coefficiency system_test >../log 2>&1
make $coefficiency system_test > ../log 2>&1  210.93s user 71.58s system 658% 
cpu 42.895 total
/opt/lmi/src/lmi[0]$cat ../log
System test:
All 1505 files match.

Improvement: (45.508 - 42.895) / 45.508 = six percent. Removing the
'install' prerequisite makes the comparison unfair because it has a
significant cost even when the build is up to date, viz.:

/opt/lmi/src/lmi[0]$time make $coefficiency install check_physical_closure 
>../log 2>&1
make $coefficiency install check_physical_closure > ../log 2>&1  2.36s user 
0.12s system 95% cpu 2.596 total

Adding that back in: 2.596 + 42.895 = 45.491, so the improvement is
probably (45.508 - 45.491) / 45.508 = four hundredths of a percent.
But that's not the best way to test: we really should measure only
the time spent running lmi, not 'ihs_crc_comp'; and running 32 lmi
instances in parallel means we're really measuring only the one that
takes the longest--so, repeating with this temporary change:

--------8<--------8<--------8<--------
diff --git a/workhorse.make b/workhorse.make
index ac2e07a..f86456f 100644
--- a/workhorse.make
+++ b/workhorse.make
@@ -1297,6 +1297,8 @@ $(testdecks):
          --pyx=system_testing \
          --file=$@
        @$(MD5SUM) --binary $(basename $(notdir $@)).* >> $(system_test_md5sums)
+
+xyzzy0:
        @for z in $(dot_test_files); \
          do \
            $(PERFORM) $(bin_dir)/ihs_crc_comp$(EXEEXT) $$z 
$(touchstone_dir)/$$z \
@@ -1305,12 +1307,14 @@ $(testdecks):
          done
 
 .PHONY: system_test
-system_test: $(data_dir)/configurable_settings.xml $(touchstone_md5sums) 
install
+system_test: $(data_dir)/configurable_settings.xml $(touchstone_md5sums)
        @$(ECHO) System test:
        @$(RM) --force $(addprefix $(test_dir)/*., $(test_result_suffixes))
        @[ "$(strip $(testdecks))" != "" ] || ( $(ECHO) No testdecks. && false )
        @testdecks=`$(LS) --sort=size $(testdecks) || $(ECHO) $(testdecks)` \
          && $(MAKE) --file=$(this_makefile) --directory=$(test_dir) $$testdecks
+
+xyzzy1:
        @$(SORT) --output=$(system_test_analysis) $(system_test_analysis)
        @$(SORT) --key=2  --output=$(system_test_md5sums) $(system_test_md5sums)
        @$(CP) --preserve --update $(system_test_md5sums) 
$(system_test_md5sums2)

-------->8-------->8-------->8--------

and without any "$coefficiency" parallelism, with '-flto' we get:

/opt/lmi/src/lmi[0]$time make system_test 
System test:
make system_test  119.40s user 17.83s system 93% cpu 2:27.30 total

while without '-flto' it's:

/opt/lmi/src/lmi[0]$time make system_test
System test:
make system_test  120.00s user 17.86s system 93% cpu 2:28.21 total

Improvement: (148.21 - 147.30) / 148.21 = six tenths of a percent, which
doesn't justify significantly slower builds and giving up '-ggdb'. Alas:
I really hoped to put those idle cores to good use when linking.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]