[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gcl-devel] Omnibus gcl/acl2 performance post
From: |
Matt Kaufmann |
Subject: |
Re: [Gcl-devel] Omnibus gcl/acl2 performance post |
Date: |
Tue, 1 Oct 2013 21:22:24 -0500 |
Hi, Camm --
Congratulations! I did timings using ACL2 Version 6.3 (which was
released just this morning), and here are the results, which indeed
show some speed-up.
; GCL 2.6.8 CLtL1
30144.603u 1176.089s 1:19:27.64 656.9% 0+0k 484552+4273744io 381pf+0w
; GCL 2.6.10pre CLtL1
; (from today: git commit 7e321ad46dbb96449960e5c4cd6605b2af3a971c)
29950.747u 1029.940s 1:18:17.05 659.5% 0+0k 155464+6549560io 5pf+0w
I've included both of the above results in the timing results for
various Lisps, here:
http://www.cs.utexas.edu/users/moore/acl2/v6-3/new.html#performance
Mostly I don't have much insight from your extensive data to add
beyond what you are seeing. But I might have a useful comment or two.
For starters, consider these two lines from your message:
-188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2
554.34 141 3726 3
-127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2
1039.93 198 12635 4
I think this says that 2.6.10pre is 188.79 seconds faster than 2.6.8
for the first book (188.79 is (- 554.34 365.55)) and 44.1 seconds
faster for the second. I suspect that these proofs generate rather
large amounts of "output", which we're not seeing but which is being
stored by ACL2's "gag-mode" feature for potential future replay (more
on that below). So maybe there's an unusually large amount of
garbage, and your improvements for garbage collection and/or reduction
of cons words are helping. (The figure of -44.1 for "r2.6.10pre vs
r2.6.10pre.widecons" on elementary-bounders might back that up, unless
I'm reading it backwards!)
(Details about the "gag-mode" issue above: I plan to fix that problem
in the future, so that storing that output doesn't happen when proof
output is inhibited, as is the case during book certification.
Actually, I made such a fix just to the above elementary-bounders book
so that I could certify it on my Mac, using 2.6.8 with its limited
maxpages. But your tests probably don't use that fix. When I install
a fix in the ACL2 sources I'll send you a patch to play with, if you
ask me to do so.)
Here's another that I can probably explain:
-107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20
210 2 174.55 87 3042 2
I think that one makes VERY heavy use of fixnums, especially the form
(def::ung rectest (x) ...), without declarations or such to help.
Finally, thanks for the fix to save-gprof.lsp, which I haven't looked
at for many years. I'm pretty sure that instead of adding the
argument :unix, the added argument should be the name of the current
os, which in the ACL2 package is (os (w *the-live-state*)). If you
think I'm wrong, please let me know. Note that this expression does
indeed evaluate to :unix on a linux machine.
Thanks --
-- Matt
From: Camm Maguire <address@hidden>
Cc: address@hidden
Date: Tue, 01 Oct 2013 15:04:27 -0400
Greetings! And thank you so much for this very helpful report!
This post is long, so feel free to skip over any uninteresting detail.
The short news is, 2.6.10pre is now even or better with 2.6.8.
=============================================================================
In the analysis below, several regressions have been run collecting
the following statistics:
i=$(find books -name "*.cert.out");
for j in $i; do
awk '/seconds runtime/ {k=$(NF-2);}
/\[S?GC for/ {i++;a=gensub(".*T=([0-9]*).*","\\1","g")+0;j+=a}
/\[SGC on/ {l++}
END {print k,i,j,l,m}' m=$j $j;done | sort -n
giving the runtime seconds, the number of gc calls, the amount of gc
time, and the number of sgc on/off calls, which can be compared using
(e.g. files r2.6.10pre vs r2.6.8)
join <(awk '{print $5,$1,$2,$3,$4}' r2.6.10pre|sort) <(awk '{print
$5,$1,$2,$3,$4}' r2.6.8|sort) | awk '{print $2-$6,$0}' | sort -n
in turn giving a report like
-188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2
554.34 141 3726 3
-127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2
1039.93 198 12635 4
-84.02 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20
210 2 151.25 113 3797 4
...
16.2 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 171.22
44 817 2
16.56 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 486.32
50 1378 2 469.76 103 2498 3
20.79 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 441.88
80 1868 3
Perhaps your knowledge of the various books and their algorithms might
shed light as to why certain factors produce these particular results
in the discussion below.
I'll be placing the results at http://people.debian.org/~camm/acl2/
with hopefully self-explanatory names.
=============================================================================
Performance issues:
=======================================
1) beginning relblock allocation
Issue: when gcl saves an image, it eliminates the hole and minimizes
the relblock space to make the disk image smaller. All gc statistics
used in balancing the heap and reducing overall gc time
(e.g. si::*optimize-maximum-pages*) are reset to 0. This seems to be
appropriate, as the type of calculation involved in building an image
is not necessarily representative of that done is using it. So on
startup, one has somewhat of a 'shrink wrapped' image. It takes some
time for gcl to gather new statistics and expand the heap from this
state. The idea is to expand each page type so that its size vis a
vis the rate of allocation is equivalent to the others. Thus, if for
some reason one starts with a larger allocation of a given type by
fiat, the gc will be triggered by the other types, and they will scale
according to the fiat type. As 2.6.8 did not shrink wrap the
relblock, it starts with a very large allocation, causing the heap to
grow more quickly and save gc time relative to 2.6.10.
Current status: At startup, 2.6.10 now scales the new hole size as
a set fraction of the number of (dynamically determined) available
pages, and the relblock size as a multiple of the non-relocatable
heap. Both of these operations cost nothing, but are somewhat
ad-hoc. The functions #'si::set-starting-hole-divisor and
#'si::set-starting-relblock-heap-multiple are provided to tune if
necessary the defaults of 10 and 2.
2) cons size
Issue: 2.6.9 and forward defaults to a two word cons, whereas 2.6.8
has a three word cons, the first word being a type word. This is to
save space and restore the build on more limited 32bit machines, which
has now been achieved. (acl2 6.2 is now in Debian testing). It should
also help with memory bandwidth. There is however an extra branch
required in typing an object, and more if immediate fixnums are in
force. In principle, this should easily be dominated by the cost of
referencing the pointer, but this should be tested.
Current status: we have an extra configure switch
--enable-widecons ; which defaults to "no".
Thankfully widecons is a net loss of some 7 min, with a profile like
(r2.6.10pre vs r2.6.10pre.widecons):
-44.1 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2
956.80 114 7651 2
-15.9 books/centaur/vl/transforms/xf-sizing.cert.out 134.27 36 790 2 150.17
37 880 2
-11.77 books/centaur/regression/common.cert.out 177.72 66 2074 2 189.49 51
1516 2
...
15.44 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 447.23
52 1537 2
17.25
books/workshops/2004/legato/support/generic-theory-tail-recursion-mult.cert.out
194.77 34 624 2 177.52 31 566 2
33.19 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 486.32
50 1378 2 453.13 56 1483 2
3) immediate fixnums
Issue: 2.6.9 and forward support immediate fixnums, the rationale
having been to lower memory requirements as described above. These
provide faster inlined arithmetic and comparisons, though require
extra branching when typing an object. When a two word cons and
immediate fixnums are both present, up to four branches can be
required in a typing, as immediate fixnums can now appear in the cdr
of the cons inspected to determine the type. We should see if this is
a win too.
Current status: We now have the following configure switches
--enable-immfix ; defaults to yes, can be disabled with --disable-immfix
--enable-fastimmfix=xx ; try to get at least an xx bit wide fixnum table
centered
; on the NULL address, which has no boxing cost
; default is 64, meaning use high memory
; immediate fixnums requiring arithmetic to box.
--enable-safecdr ; do not place immediate fixnums in cdr, but
; boxed versions instead, and speed up typing
; accordingly. Defaults to "no"
--enable-safecdrdbg ; debug the above algorithm and error on failure
The dominant fixnum cost is allocation, and this has already been
virtually eliminated in acl2 via its use of
#'allocate-bigger-fixnum-range. So for acl2, its really just the
typing cost vs. the arithmetic acceleration and saving of a pointer
dereference.
--disable-immfix is a slight net gain (30 sec), with a profile like
(r2.6.10pre vs r2.6.10pre.no-immfix):
-107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23
20 210 2 174.55 87 3042 2
-10.62 books/unicode/utf8-decode.cert.out 116.08 26 325 2 126.70 24 316 2
-9.86 books/centaur/tutorial/intro.cert.out 159.80 68 1905 2 169.66 60 1793 2
...
12.08 books/workshops/1999/ste/inference.cert.out 149.30 20 287 2 137.22 24
294 2
13.54 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 173.88
27 463 2
16.07 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2
349.48 88 1950 2
I anticipate this is not likely to hold up when other immediate fixnum
accelerations are backported from master, e.g. accelerating eql.
Several of these experiments are still running. I'll post more if desired.
4) general gc:
Several inner gc loops have been optimized, significantly speeding up
contiguous gc in particular, and to a lesser extent, relocatable. All
contiguous gc used to also collect relocatable by default -- this is
now separated for efficiency. Traditionally, (si::gbc t) collected
everything, (si::gbc 1) relocatable, and (si::gbc nil) cells only.
This is still the case, with (si::gbc 0) added to collect contiguous
only.
echo '(time (progn (setq si::*optimize-maximum-pages* nil)(dotimes (i 1000)
(si::gbc ???))))' | ./saved_ansi_gcl |grep "^gbc time"
??? 2.6.8 2.6.10pre
=== ===== =========
t 9.500 7.510
1 7.250 6.980
0 ----- 6.699
nil 7.210 6.050
5) contblock allocation
some sgc bit logic has changed, particularly regarding contiguous
blocks, making the large contiguous allocation I recommended earlier
now possibly obsolete. I have not tested this, and leaving it in
should cause minimal harm.
When I get around to it, I really need to implement a trampoline at
the end of .data for all pc32 relocations, rendering the whole issue
mute. This will likely complicate many platforms, however.
Finally, I've used your save-gprof.lsp file, and have found one bug:
diff -u save-gprof.lsp.ori save-gprof.lsp
--- save-gprof.lsp.ori 2013-10-01 13:56:55.739771000 -0500
+++ save-gprof.lsp 2013-10-01 13:57:15.712921000 -0500
@@ -302,7 +302,7 @@
`(setq ,sym-initial-cbd
(,sym-pathname-os-to-unix
- (namestring (truename ""))
+ (namestring (truename "")) :unix
,sym-state))
`(,sym-f-put-global
',sym-cbd
Take care,
--
Camm Maguire address@hidden
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah