[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode
From: |
Emilio G. Cota |
Subject: |
[Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode |
Date: |
Sun, 23 Aug 2015 20:23:29 -0400 |
Hi all,
Here is MTTCG code I've been working on out-of-tree for the last few months.
The patchset applies on top of pbonzini's mttcg branch, commit ca56de6f.
Fetch the branch from: https://github.com/bonzini/qemu/commits/mttcg
The highlights of the patchset are as follows:
- The first 5 patches are direct fixes to bugs only in the mttcg
branch.
- Patches 6-12 fix issues in the master branch.
- The remaining patches are really the meat of this patchset.
The main features are:
* Support of MTTCG for both user and system mode.
* Design: per-CPU TB jump list protected by a seqlock,
if the TB is not found there then check on the global, RCU-protected 'hash
table'
(i.e. fixed number of buckets), if not there then grab lock, check again,
and if it's not there then add generate the code and add the TB to the hash
table.
It makes sense that Paolo's recent work on the mttcg branch ended up
being almost identical to this--it's simple and it scales well.
* tb_lock must be held every time code is generated. The rationale is
that most of the time QEMU is executing code, not generating it.
* tb_flush: do it once all other CPUs have been put to sleep by calling
rcu_synchronize().
We also instrument tb_lock to make sure that only one tb_flush request can
happen at a given time. For this a mechanism to schedule work is added to
supersede cpu_sched_safe_work, which cannot work in usermode. Here I've
toyed with an alternative version that doesn't force the flushing CPU to
exit, but in order to make this work we have save/restore the RCU read
lock while tb_lock is held in order to avoid deadlocks. This isn't too
pretty but it's good to know that the option is there.
* I focused on x86 since it is a complex ISA and we support many cores via
-smp.
I work on a 64-core machine so concurrency bugs show up relatively easily.
Atomics are modeled using spinlocks, i.e. one host lock per guest cache
line.
Note that spinlocks are way better than mutexes for this--perf on 64-cores
is 2X with spinlocks on highly concurrent workloads (synchrobench, see
below).
Advantages:
+ Scalability. No unrelated atomics (e.g. atomics on the same page)
can interfere with each other. Of course if the guest code
has false sharing (i.e. atomics on the same cache line), then
there's not much the host can do about that.
This is an improved version over what I sent in May:
https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg01641.html
Performance numbers are below.
+ No requirements on the capabilities of the host machine, e.g.
no need for a host cmpxchg instruction. That is, we'd have no problem
running x86 code on a weaker host (say ARM/PPC) although of course we'd
have to sprinkle quite a few memory barriers. Note that the current
MTTCG relies on cmpxchg(), which would be insufficient to run x86 code
on ARM/PPC since that cmpxchg could very well race with a regular store
(whereas in x86 it cannot).
+ Works unchanged for both system and user modes. As far as I can
tell the TLB-based approach that Alvise is working on couldn't
be used without the TLB--correct me if I'm wrong, it's been
quite some time since I looked at that work.
Disadvantages:
- Overhead is added to every guest store. Depending on how frequent
stores are, this can end up being significant single-threaded
overhead (I've measured from a few % to up to ~50%).
Note that this overhead applies to strong memory models such
as x86, since the ISA can deal with concurrent stores and atomic
instructions. Weaker memory models such as ARM/PPC's wouldn't have this
overhead.
* Performance
I've used four C/C++ benchmarks from synchrobench:
https://github.com/gramoli/synchrobench
I'm running them with these arguments: -u 0 -f 1 -d 10000 -t $n_threads
Here are two comparisons;
* usermode vs. native http://imgur.com/RggzgyU
* qemu-system vs qemu-KVM http://imgur.com/H9iH06B
(full-system is run with -m 4096).
Throughput is normalised for each of the four configurations over their
throughput with 1 thread.
For single-thread performance overhead of instrumenting writes I used
two apps from PARSEC, all of them with the 'large' input:
[Note that for the multithreaded tests I did not use PARSEC; it doesn't
scale at all on large systems]
blackscholes 1 thread, ~8% of stores per instruction:
pbonzini/mttcg+Patches1-5: 62.922099012 seconds ( +- 0.05% )
+entire patchset: 67.680987626 seconds ( +- 0.35% )
That's about an 8% perf overhead.
swaptions 1 thread, ~7% of stores per instruction:
pbonzini/mttcg+Patches1-5: 144.542495834 seconds ( +- 0.49% )
+entire patchset: 157.673401200 seconds ( +- 0.25% )
That's about an 9% perf overhead.
All tests use taskset appropriately to pack threads into CPUs in the
same NUMA node, if possible.
All tests are run on a 64-core (4x16) AMD Opteron 6376 with turbo core
disabled.
* Known Issues
- In system mode, when run with a high number of threads, segfaults on
translated code happen every now and then.
Is there anything useful I can do with the segfaulting address? For
example:
(gdb) bt
#0 0x00007fbf8013d89f in ?? ()
#1 0x0000000000000000 in ?? ()
Also, are there any things that should be protected by tb_lock but
aren't? The only potential issue I've thought of so far is direct jumps
racing with tb_phys_invalidate, but need to analyze in more detail.
* Future work
- Run on PowerPC host to look at how bad the barrier sprinkling has to be.
I have access to a host so should do this in the next few days. However,
ppc-usermode doesn't work in multithreaded--help would be appreciated,
see this thread:
http://lists.gnu.org/archive/html/qemu-ppc/2015-06/msg00164.html
- Support more ISAs. I have done ARM, SPARC and PPC, but haven't
tested them much so I'm keeping them out of this patchset.
Thanks,
Emilio
- [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode,
Emilio G. Cota <=