[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing state
From: |
Lluís Vilanova |
Subject: |
[Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches |
Date: |
Thu, 22 Dec 2016 19:35:37 +0100 |
User-agent: |
StGit/0.17.1-dirty |
Optimizes tracing of events with the 'tcg' and 'vcpu' properties (e.g., memory
accesses), making it feasible to statically enable them by default on all QEMU
builds.
Some quick'n'dirty numbers with 400.perlbench (SPECcpu2006) on the train input
(medium size - suns.pl) and the guest_mem_before event:
* vanilla, statically disabled
real 0m5,827s
user 0m5,800s
sys 0m0,024s
* vanilla, statically enabled (overhead: 2.35x)
real 0m13,696s
user 0m13,684s
sys 0m0,008s
* multi-tb, statically disabled (overhead: 1.09x)
real 0m6,383s
user 0m6,352s
sys 0m0,028s
* multi-tb, statically enabled (overhead: 1.11x)
real 0m6,493s
user 0m6,468s
sys 0m0,020s
Right now, events with the 'tcg' property always generate TCG code to trace that
event at guest code execution time, where the event's dynamic state is checked.
This series adds a performance optimization where TCG code for events with the
'tcg' and 'vcpu' properties is not generated if the event is dynamically
disabled. This optimization raises two issues:
* An event can be dynamically disabled/enabled after the corresponding TCG code
has been generated (i.e., a new TB with the corresponding code should be
used).
* Each vCPU can have a different dynamic state for the same event (i.e., tracing
the memory accesses of only one process pinned to a vCPU).
To handle both issues, this series replicates the shared physical TB cache,
creating a separate physical TB cache for every combination of event states
(those with the 'vcpu' and 'tcg' properties). Then, all vCPUs tracing the same
events will use the same physical TB cache.
Sharing physical TBs makes this very space efficient (only the physical TB
caches, simple arrays of pointers, are replicated). Sharing physical TB caches
maximizes TB reuse across vCPUs whenever possible, and makes dynamic event state
changes more efficient (simply use a different physical TB cache).
The physical TB cache array is indexed with the vCPU's trace event state
bitmask. This is simpler and more efficient than emitting TCG code to check if
an event needs tracing, where we should still move the tracing call code to
either a cold path (making tracing performance worse), or leave it inlined
(making non-tracing performance worse).
This solution is also more efficient than eliding TCG code only when *zero*
vCPUs are tracing an event, since enabling it on a single vCPU will impact the
performance of all other vCPUs that are not tracing that event.
Note on overheads: I suspect the culprit of the 1.15x overhead lies in the
double dereference
Signed-off-by: Lluís Vilanova <address@hidden>
---
Changes in v3
=============
* Rebase on 0737f32daf.
* Do not use reserved symbol prefixes ("__") [Stefan Hajnoczi].
* Refactor trace_get_vcpu_event_count() to be inlinable.
* Optimize cpu_tb_cache_set_requested() (hottest path).
Changes in v2
=============
* Fix bitmap copy in cpu_tb_cache_set_apply().
* Split generated code re-alignment into a separate patch [Daniel P. Berrange].
Lluís Vilanova (6):
exec: [tcg] Refactor flush of per-CPU virtual TB cache
trace: Make trace_get_vcpu_event_count() inlinable
exec: [tcg] Use multiple physical TB caches
exec: [tcg] Switch physical TB cache based on vCPU tracing state
trace: [tcg] Do not generate TCG code to trace dinamically-disabled events
trace: [tcg,trivial] Re-align generated code
cpu-exec.c | 11 +++-
cputlb.c | 2 -
include/exec/exec-all.h | 12 ++++
include/exec/tb-context.h | 2 -
include/qom/cpu.h | 3 +
qom/cpu.c | 3 +
scripts/tracetool/__init__.py | 1
scripts/tracetool/backend/dtrace.py | 2 -
scripts/tracetool/backend/ftrace.py | 20 +++----
scripts/tracetool/backend/log.py | 16 +++--
scripts/tracetool/backend/simple.py | 2 -
scripts/tracetool/backend/syslog.py | 6 +-
scripts/tracetool/backend/ust.py | 2 -
scripts/tracetool/format/h.py | 24 ++++++--
scripts/tracetool/format/tcg_h.py | 19 +++++-
scripts/tracetool/format/tcg_helper_c.py | 3 +
trace/control-internal.h | 5 ++
trace/control-target.c | 1
trace/control.c | 9 +--
trace/control.h | 5 +-
translate-all.c | 92 +++++++++++++++++++++++++-----
translate-all.h | 43 ++++++++++++++
translate-all.inc.h | 13 ++++
23 files changed, 237 insertions(+), 59 deletions(-)
create mode 100644 translate-all.inc.h
To: address@hidden
Cc: Stefan Hajnoczi <address@hidden>
Cc: Eduardo Habkost <address@hidden>
Cc: Eric Blake <address@hidden>
- [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches,
Lluís Vilanova <=
- [Qemu-devel] [PATCH v3 2/6] trace: Make trace_get_vcpu_event_count() inlinable, Lluís Vilanova, 2016/12/22
- [Qemu-devel] [PATCH v3 6/6] trace: [tcg, trivial] Re-align generated code, Lluís Vilanova, 2016/12/22
- [Qemu-devel] [PATCH v3 4/6] exec: [tcg] Switch physical TB cache based on vCPU tracing state, Lluís Vilanova, 2016/12/22
- [Qemu-devel] [PATCH v3 1/6] exec: [tcg] Refactor flush of per-CPU virtual TB cache, Lluís Vilanova, 2016/12/22
- [Qemu-devel] [PATCH v3 3/6] exec: [tcg] Use multiple physical TB caches, Lluís Vilanova, 2016/12/22
- [Qemu-devel] [PATCH v3 5/6] trace: [tcg] Do not generate TCG code to trace dinamically-disabled events, Lluís Vilanova, 2016/12/22
- Re: [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches, Richard Henderson, 2016/12/23