[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC v3 04/19] docs: new design document multi-thread-t
Re: [Qemu-devel] [RFC v3 04/19] docs: new design document multi-thread-tcg.txt (DRAFTING)
Fri, 24 Jun 2016 00:33:49 +0300
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0
On 03/06/16 23:40, Alex Bennée wrote:
> This is a current DRAFT of a design proposal for upgrading TCG emulation
> to take advantage of modern CPUs by running a thread-per-CPU. The
> document goes through the various areas of the code affected by such a
> change and proposes design requirements for each part of the solution.
> It has been written *without* explicit reference to the current ongoing
> efforts to introduce this feature. The hope being we can review and
> discuss the design choices without assuming the current choices taken by
> the implementation are correct.
> Signed-off-by: Alex Bennée <address@hidden>
> - initial version
> - update discussion on locks
> - bit more detail on vCPU scheduling
> - explicitly mention Translation Blocks
> - emulated hardware state already covered by iomutex
> - a few minor rewords
> - mention this covers system-mode
> - describe main main-loop and lookup hot-path
> - mention multi-concurrent-reader lookups
> - enumerate reasons for invalidation
> - add more details on lookup structures
> - describe the softmmu hot-path better
> - mention store-after-load barrier problem
> docs/multi-thread-tcg.txt | 225
> 1 file changed, 225 insertions(+)
> create mode 100644 docs/multi-thread-tcg.txt
> diff --git a/docs/multi-thread-tcg.txt b/docs/multi-thread-tcg.txt
> new file mode 100644
> index 0000000..5c88c99
> --- /dev/null
> +++ b/docs/multi-thread-tcg.txt
> @@ -0,0 +1,225 @@
> +Copyright (c) 2015 Linaro Ltd.
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +the COPYING file in the top-level directory.
> +STATUS: DRAFTING
> +This document outlines the design for multi-threaded TCG system-mode
> +emulation. The current user-mode emulation mirrors the thread
> +structure of the translated executable.
> +The original system-mode TCG implementation was single threaded and
> +dealt with multiple CPUs by with simple round-robin scheduling. This
> +simplified a lot of things but became increasingly limited as systems
> +being emulated gained additional cores and per-core performance gains
> +for host systems started to level off.
> +vCPU Scheduling
> +We introduce a new running mode where each vCPU will run on its own
> +user-space thread. This will be enabled by default for all
> +FE/BE combinations that have had the required work done to support
> +this safely.
> +In the general case of running translated code there should be no
> +inter-vCPU dependencies and all vCPUs should be able to run at full
> +speed. Synchronisation will only be required while accessing internal
> +shared data structures or when the emulated architecture requires a
> +coherent representation of the emulated machine state.
> +Shared Data Structures
> +Main Run Loop
> +Even when there is no code being generated there are a number of
> +structures associated with the hot-path through the main run-loop.
> +These are associated with looking up the next translation block to
> +execute. These include:
> + tb_jmp_cache (per-vCPU, cache of recent jumps)
> + tb_phys_hash (global, phys address->tb lookup)
> +As TB linking only occurs when blocks are in the same page this code
> +is critical to performance as looking up the next TB to execute is the
> +most common reason to exit the generated code.
> +DESIGN REQUIREMENT: Make access to lookup structures safe with
> +multiple reader/writer threads. Minimise any lock contention to do it.
> +Global TCG State
> +We need to protect the entire code generation cycle including any post
> +generation patching of the translated code. This also implies a shared
> +translation buffer which contains code running on all cores. Any
> +execution path that comes to the main run loop will need to hold a
> +mutex for code generation. This also includes times when we need flush
> +code or entries from any shared lookups/caches. Structures held on a
> +per-vCPU basis won't need locking unless other vCPUs will need to
> +modify them.
> +DESIGN REQUIREMENT: Add locking around all code generation and TB
> +patching. If possible make shared lookup/caches able to handle multiple
> +readers without locks otherwise protect them with locks as well.
> +Translation Blocks
> +Currently the whole system shares a single code generation buffer
> +which when full will force a flush of all translations and start from
> +scratch again.
> +Once a basic block has been translated it will continue to be used
> +until it is invalidated. These invalidation events are typically due
> +a change to the state of a physical page:
> + - code modification (self modify code, patching code)
> + - page changes (new mapping to physical page)
Mapping changes does invalidate translation blocks in user-mode
emulation only. In system-mode emulation, we just do TLB flush and clean
CPU's 'tb_jmp_cache' but don't invalidate any translation block, just
follow tlb_flush(). That is why in system-mode emulation we can't do
direct jumps to another page or to a TB which span a page boundary.
> + - debugging operations (breakpoint insertion/removal)
> +There exist several places reference to TBs exist which need to be
> +cleared in a safe way.
You mean: "There are several places where references to TBs exist which
> +The main reference is a global page table (l1_map) which provides a 2
> +level look-up for PageDesc structures which contain pointers to the
> +start of a linked list of all Translation Blocks in that page (see
Actually, 'l1_map' is multi-level, see the comment above 'V_L2_BITS'
> +When a block is invalidated any blocks which directly jump to it need
> +to have those jumps removed. This requires navigating the tb_jump_list
> +linked list as well as patching the jump code in a safe way.
> +Finally there are a number of look-up mechanisms for accelerating
> +lookup of the next TB. These cache and hashed tables need to have
> +references removed in a safe way.
> +DESIGN REQUIREMENT: Safely handle invalidation of TBs
> + - safely patch direct jumps
> + - remove central PageDesc lookup entries
> + - ensure lookup caches/hashes are safely updated
> +Memory maps and TLBs
> +The memory handling code is fairly critical to the speed of memory
> +access in the emulated system. The SoftMMU code is designed so the
> +hot-path can be handled entirely within translated code. This is
> +handled with a per-vCPU TLB structure which once populated will allow
> +a series of accesses to the page to occur without exiting the
> +translated code. It is possible to set flags in the TLB address which
> +will ensure the slow-path is taken for each access. This can be done
> +to support:
> + - Memory regions (dividing up access to PIO, MMIO and RAM)
> + - Dirty page tracking (for code gen, migration and display)
> + - Virtual TLB (for translating guest address->real address)
> +When the TLB tables are updated we need to ensure they are done in a
> +safe way by bringing all executing threads to a halt before making the
Actually, we just need to be thread-safe when modifying vCPU TLB entries
from other thread. If it is possible to do using (relaxed) atomic memory
access, we could obviously benefit from it.
> +DESIGN REQUIREMENTS:
> + - TLB Flush All/Page
> + - can be across-CPUs
> + - will need all other CPUs brought to a halt
Only *cross-CPU* TLB flush *may* need other CPU brought to halt.
> + - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
> + - This is a per-CPU table - by definition can't race
> + - updated by its own thread when the slow-path is forced
> +Emulated hardware state
> +Currently thanks to KVM work any access to IO memory is automatically
> +protected by the global iothread mutex. Any IO region that doesn't use
> +global mutex is expected to do its own locking.
Worth it mentioning that we're going to push iothread locking down in
TCG as much as possible?
> +Memory Consistency
> +Between emulated guests and host systems there are a range of memory
> +consistency models. Even emulating weakly ordered systems on strongly
> +ordered hosts needs to ensure things like store-after-load re-ordering
> +can be prevented when the guest wants to.
> +Memory Barriers
> +Barriers (sometimes known as fences) provide a mechanism for software
> +to enforce a particular ordering of memory operations from the point
> +of view of external observers (e.g. another processor core). They can
> +apply to any memory operations as well as just loads or stores.
> +The Linux kernel has an excellent write-up on the various forms of
> +memory barrier and the guarantees they can provide .
> +Barriers are often wrapped around synchronisation primitives to
> +provide explicit memory ordering semantics. However they can be used
> +by themselves to provide safe lockless access by ensuring for example
> +a signal flag will always be set after a payload.
> +DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
> +This would enforce a strong load/store ordering so all loads/stores
> +complete at the memory barrier. On single-core non-SMP strongly
> +ordered backends this could become a NOP.
> +There may be a case for further refinement if this causes performance
I think it's worth mentioning that aside from explicit standalone memory
barrier instructions there's also implicit memory ordering semantics
which comes with each guest memory access instruction, e.g. relaxed,
acquire/release, sequentially consistent.
> +Memory Control and Maintenance
> +This includes a class of instructions for controlling system cache
> +behaviour. While QEMU doesn't model cache behaviour these instructions
> +are often seen when code modification has taken place to ensure the
> +changes take effect.
> +Synchronisation Primitives
> +There are two broad types of synchronisation primitives found in
> +modern ISAs: atomic instructions and exclusive regions.
> +The first type offer a simple atomic instruction which will guarantee
> +some sort of test and conditional store will be truly atomic w.r.t.
> +other cores sharing access to the memory. The classic example is the
> +x86 cmpxchg instruction.
> +The second type offer a pair of load/store instructions which offer a
> +guarantee that an region of memory has not been touched between the
> +load and store instructions. An example of this is ARM's ldrex/strex
> +pair where the strex instruction will return a flag indicating a
> +successful store only if no other CPU has accessed the memory region
> +since the ldrex.
> +Traditionally TCG has generated a series of operations that work
> +because they are within the context of a single translation block so
> +will have completed before another CPU is scheduled. However with
> +the ability to have multiple threads running to emulate multiple CPUs
> +we will need to explicitly expose these semantics.
> +DESIGN REQUIREMENTS:
> + - atomics
What "atomics" means here ...
> + - Introduce some atomic TCG ops for the common semantics
> + - The default fallback helper function will use qemu_atomics
> + - Each backend can then add a more efficient implementation
> + - load/store exclusive
... and how they relates to "load/store exclusive"?
> + [AJB:
> + There are currently a number proposals of interest:
> + - Greensocs tweaks to ldst ex (using locks)
> + - Slow-path for atomic instruction translation 
> + - Helper-based Atomic Instruction Emulation (AIE) 
> + ]
We also have a problem emulating 64-bit guest on 32-bit host: while a
naturally aligned machine-word memory access is usually atomic, a single
64-bit guest memory access instruction is translated into a series of
two 32-bit host memory access instruction. That can (will) break guest
code assumptions about memory access atomicity.
> + http://thread.gmane.org/gmane.comp.emulators.qemu/334561
> + http://thread.gmane.org/gmane.comp.emulators.qemu/335297
- Re: [Qemu-devel] [RFC v3 02/19] translate_all: DEBUG_FLUSH -> DEBUG_TB_FLUSH, (continued)
- [Qemu-devel] [RFC v3 05/19] exec: add assert_debug_safe and notes on debug structures, Alex Bennée, 2016/06/03
- [Qemu-devel] [RFC v3 06/19] tcg: comment on which functions have to be called with tb_lock held, Alex Bennée, 2016/06/03
- [Qemu-devel] [RFC v3 04/19] docs: new design document multi-thread-tcg.txt (DRAFTING), Alex Bennée, 2016/06/03
- Re: [Qemu-devel] [RFC v3 04/19] docs: new design document multi-thread-tcg.txt (DRAFTING),
Sergey Fedorov <=
- [Qemu-devel] [RFC v3 07/19] translate-all: Add assert_memory_lock annotations, Alex Bennée, 2016/06/03
- [Qemu-devel] [RFC v3 12/19] tcg: add kick timer for single-threaded vCPU emulation, Alex Bennée, 2016/06/03
- [Qemu-devel] [RFC v3 08/19] tcg: protect TBContext with tb_lock., Alex Bennée, 2016/06/03
- [Qemu-devel] [RFC v3 11/19] tcg: add options for enabling MTTCG, Alex Bennée, 2016/06/03
- [Qemu-devel] [RFC v3 13/19] tcg: rename tcg_current_cpu to tcg_current_rr_cpu, Alex Bennée, 2016/06/03