[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [patches] Re: [PULL] RISC-V QEMU Port Submission
From: |
Emilio G. Cota |
Subject: |
Re: [Qemu-devel] [patches] Re: [PULL] RISC-V QEMU Port Submission |
Date: |
Mon, 5 Mar 2018 14:00:14 -0500 |
User-agent: |
Mutt/1.5.24 (2015-08-30) |
On Sat, Mar 03, 2018 at 02:26:12 +1300, Michael Clark wrote:
> It was qemu-2.7.50 (late 2016). The benchmarks were generated mid last year.
>
> I can run the benchmarks again... Has it doubled in speed?
It depends on the benchmarks. Small-ish benchmarks such as rv8-bench
show about a 1.5x speedup since QEMU v2.6.0 for Aarch64:
Aarch64 rv8-bench performance under QEMU user-mode
Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
4.5 +-+----+------+------+------+------+------+------+------+------+----+-+
| ++ |
4 +-+..........v2.8.0.........v2.9.0........v2.10.0.%%.....v2.11.0....+-+
3.5 address@hidden
| %%@ |
3 address@hidden
2.5 address@hidden
| ++ $$$%@ |
2 address@hidden@...............+-+
| ##+$%@ ##$$%@ ## $%@ |
1.5 address@hidden@address@hidden@address@hidden@.+-+
1
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden
| **# address@hidden address@hidden address@hidden address@hidden
address@hidden@**# address@hidden address@hidden $%@ |
0.5
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden
aes bigidhrystone miniz norx primes qsort sha512geomean
png: https://imgur.com/Agr5CJd
SPEC06int shows a larger improvement, up to ~2x avg speedup for the train
set:
Aarch64 SPEC06int (train set) performance under QEMU user-mode
Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
4 +-+--+----+----+----+----+----+----+----+----+----+----+----+----+--+-+
| %% ++ |
3.5 address@hidden
| %%@ %%@ ++ |
3 address@hidden@.......%%+.....+-+
| +$%@ |+ %%@ %%@ |
2.5 address@hidden@address@hidden@.....+-+
2 address@hidden@address@hidden@address@hidden@..++.+-+
| ##%@ %%@ ##%@ +$%@ %%@ %%@ ##%@ $%@ %%@ |
1.5
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden
| address@hidden@**#%@ +++**#%@ ##%@ ++ address@hidden@ ##%@
##%@ ##%@ |
1
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden
|
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@**#%@
|
0.5
address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden@address@hidden
401.bzi403.g429445.g456.h462.libq464.h471.omn4483.xalancbgeomean
png: https://imgur.com/JknVT5H
Note that the test set is less sensitive to the changes:
https://imgur.com/W7CT0eO
Running small benchmarks (such as SPEC "test" or rv8-bench) is
very useful to get quick feedback on optimizations. However, some
of these runs are still dominated by parts of the code that aren't
that relevant -- for instance, some of them take so little time to
run that the major contributor to execution time is memory allocation.
Therefore, when publishing results it's best to stick with larger
benchmarks that run for longer (e.g. SPEC "train" set), which are more
sensitive to DBT performance.
I tried running some other benchmarks, such as nbench[1], under rv-jit.
I quickly get a "bus error" though -- don't know if I'm doing anything
wrong, or maybe compiling with the glibc cross-compiler I used
to build riscv linux isn't supported.
I managed though to run rv8-bench on both rv-jit and qemu (v8 patchset);
rv-jit is 1.30x faster on average for those, although note I dropped
qsort because it wasn't working properly on rv-jit:
rv8-bench performance under rv-jit and QEMU user-mode
Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
[qsort does not finish cleanly for rv8, so I dropped it.]
3 +-+-----+-------+------+-------+-------+-------+------+-------+-----+-+
2.5 +-+..................*****..........................................+-+
| *-+-* b1bae23b7c2 |
2 +-+..................*...*...................+-+-+..................+-+
1.5 +-+...........*****..*...*...................*****..................+-+
| ***** *-+-* * * ***** * * ++-+ ***** |
1 +-+...*-+-*...*...*..*...*...*...*...*****...*...*...****...*...*...+-+
0.5 +-+---*****---*****--*****---*****---*****---*****---****---*****---+-+
aes bigidhrystone miniz norx primes sha512 geomean
png: https://imgur.com/rLmTH3L
> I think I can get close to double again with tiered optimization and a good
> register allocator (lift RISC-V asm to SSA form). It's also a hotspot
> interpreter, which is definately faster than compiling all code, as I
> benchmarked it. It profiles and only translates hot paths, so code that
> only runs a few iterations is not translated. When I did eager transaltion
> I got a slow-down.
Yes, hotspot is great for real-life workloads (e.g. booting a system). Note
though that most benchmarks (e.g. SPEC) don't translate code that often;
most execution time is spent in loops and therefore the quality of
the generated code does matter. Hotspot detection of TBs/traces is great
for this as well, because it allows you to spend more resources generating
higher-quality code--for instance, see HQEMU[2].
Thanks,
Emilio
[1] https://github.com/cota/nbench
[2] http://www.iis.sinica.edu.tw/papers/dyhong/18243-F.pdf
PS. One page with all the png's: https://imgur.com/a/5P5zj