[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

immediate strings #2

From: Dmitry Antipov
Subject: immediate strings #2
Date: Mon, 28 Nov 2011 13:11:51 +0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:8.0) Gecko/20111115 Thunderbird/8.0

Here is the next version of immediate strings patch, with further improvements
suggested by Paul. As it was said, strings up to 21 bytes on 64-bit and up to
9 bytes on 32-bit can be immediate (trailing '\0' is not counted). Note this
code assumes sizeof (EMACS_INT) is equal to sizeof (void *), so it's not
compatible with WIDE_EMACS_INT.

Since there was a reasonable doubts whether this stuff is practically useful,
I did two benchmarks. The fisrt one was a simple string allocation benchmark,
attached as stringbench.el. The second one was just a compilation of all stuff
in lisp subdirectory with byte-force-recompile. Everything was tested with
64-bit executables and '-Q -batch' command line options.

Configuration: ./configure --prefix=/not/exists --without-sound --without-pop \
               --with-x-toolkit=lucid --without-dbus --without-libotf \
               --without-selinux --without-xft --without-gsettings \
               --without-gnutls --without-rsvg --without-xml2
Compiler: gcc 4.6.1, optimization flags -O3

Old executable size 12855360 bytes, new exectable size 12904512 bytes (0.38%
larger code size).

* Benchmark 1, 8 runs for each executable:

-- Old --

33.24user 0.23system 0:33.72elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+112338minor)pagefaults 0swaps
32.29user 0.25system 0:32.77elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+124684minor)pagefaults 0swaps
33.31user 0.24system 0:33.80elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+120164minor)pagefaults 0swaps
33.91user 0.24system 0:34.41elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+125401minor)pagefaults 0swaps
33.17user 0.27system 0:33.69elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+120374minor)pagefaults 0swaps
33.26user 0.31system 0:33.83elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+148027minor)pagefaults 0swaps
33.38user 0.28system 0:33.90elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+133420minor)pagefaults 0swaps
33.13user 0.23system 0:33.61elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+120341minor)pagefaults 0swaps

-- New --

32.59user 0.35system 0:33.18elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+149273minor)pagefaults 0swaps
32.62user 0.31system 0:33.17elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+149274minor)pagefaults 0swaps
32.44user 0.30system 0:32.98elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+145349minor)pagefaults 0swaps
29.29user 0.30system 0:29.80elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+136105minor)pagefaults 0swaps
31.90user 0.33system 0:32.47elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+161330minor)pagefaults 0swaps
34.29user 0.34system 0:34.88elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+160050minor)pagefaults 0swaps
32.64user 0.31system 0:33.20elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+150284minor)pagefaults 0swaps
33.17user 0.27system 0:33.69elapsed 99%CPU (0avgtext+0avgdata 
0inputs+0outputs (0major+126406minor)pagefaults 0swaps

-- Results --

Got 2.5% better speed, but ~3.1% larger heap usage. It's expected that heap
usage should be smaller, why it isn't? Old code increments consing_since_gc with
the number of bytes allocated for each new string's data, but new code does so
only for non-immediate strings; so, old code calls GC earlier than new, thus
giving smaller peak heap usage.

* Benchmark 2, 8 runs for each executable:

-- Old --

91.86user 0.49system 2:27.21elapsed 62%CPU (0avgtext+0avgdata 74736maxresident)k
0inputs+77864outputs (0major+39292minor)pagefaults 0swaps
91.57user 0.54system 2:27.30elapsed 62%CPU (0avgtext+0avgdata 74648maxresident)k
0inputs+78536outputs (0major+38641minor)pagefaults 0swaps
89.58user 0.52system 2:21.93elapsed 63%CPU (0avgtext+0avgdata 74684maxresident)k
0inputs+78536outputs (0major+38903minor)pagefaults 0swaps
91.53user 0.53system 2:25.14elapsed 63%CPU (0avgtext+0avgdata 74612maxresident)k
0inputs+78536outputs (0major+38538minor)pagefaults 0swaps
91.49user 0.56system 2:24.56elapsed 63%CPU (0avgtext+0avgdata 74708maxresident)k
0inputs+78528outputs (0major+38716minor)pagefaults 0swaps
91.77user 0.53system 2:24.01elapsed 64%CPU (0avgtext+0avgdata 74660maxresident)k
0inputs+78536outputs (0major+39164minor)pagefaults 0swaps
91.44user 0.54system 2:27.12elapsed 62%CPU (0avgtext+0avgdata 74728maxresident)k
0inputs+78536outputs (0major+39173minor)pagefaults 0swaps
91.72user 0.50system 2:24.25elapsed 63%CPU (0avgtext+0avgdata 74680maxresident)k
0inputs+78528outputs (0major+39538minor)pagefaults 0swaps

-- New --

89.98user 0.53system 2:22.79elapsed 63%CPU (0avgtext+0avgdata 73440maxresident)k
0inputs+78536outputs (0major+36362minor)pagefaults 0swaps
89.91user 0.51system 2:24.10elapsed 62%CPU (0avgtext+0avgdata 73528maxresident)k
0inputs+78528outputs (0major+36753minor)pagefaults 0swaps
89.85user 0.48system 2:24.74elapsed 62%CPU (0avgtext+0avgdata 73392maxresident)k
0inputs+78536outputs (0major+36745minor)pagefaults 0swaps
90.12user 0.54system 2:22.56elapsed 63%CPU (0avgtext+0avgdata 73440maxresident)k
0inputs+78536outputs (0major+37347minor)pagefaults 0swaps
89.95user 0.53system 2:23.74elapsed 62%CPU (0avgtext+0avgdata 73416maxresident)k
0inputs+78536outputs (0major+37292minor)pagefaults 0swaps
91.26user 0.53system 2:25.64elapsed 63%CPU (0avgtext+0avgdata 73440maxresident)k
0inputs+78536outputs (0major+36782minor)pagefaults 0swaps
90.03user 0.56system 2:25.01elapsed 62%CPU (0avgtext+0avgdata 73376maxresident)k
0inputs+78536outputs (0major+37418minor)pagefaults 0swaps
90.15user 0.54system 2:25.73elapsed 62%CPU (0avgtext+0avgdata 73448maxresident)k
0inputs+78536outputs (0major+37279minor)pagefaults 0swaps

-- Results --

Got ~1.3% better speed, ~1.7% smaller heap usage. Since this benchmark does a 
of things besides string allocation, 'later GC' effect is negligible here.

Obviously, new string code is more complex, and, аs it seems at first, should be
slower because any access to string data involves an evaluation of a conditional
expression, which creates more pressure to instruction cache and branch 
logic. But an overall improvement may be explained by better spatial locality 
thus better data cache utilization (normal string and it's data may be allocated
far away from each other, so when cache line is filled by accessing a member of
Lisp_String, it's very unlikely to get the same cache line filled with string 
for an immediate string, such a case should be quite rare). This may be checked,
for example, with valgrind by using it's cachegrind tool (but I didn't tried 


Attachment: stringbench.el
Description: Text document

Attachment: immstr2.patch
Description: Text document

reply via email to

[Prev in Thread] Current Thread [Next in Thread]