qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] RFC Multi-threaded TCG design document


From: Frederic Konrad
Subject: Re: [Qemu-devel] RFC Multi-threaded TCG design document
Date: Wed, 17 Jun 2015 23:45:52 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0

On 17/06/2015 20:23, Mark Burton wrote:
On 17 Jun 2015, at 18:57, Dr. David Alan Gilbert <address@hidden> wrote:

* Alex Benn?e (address@hidden) wrote:
Hi,
Shared Data Structures
======================

Global TCG State
----------------

We need to protect the entire code generation cycle including any post
generation patching of the translated code. This also implies a shared
translation buffer which contains code running on all cores. Any
execution path that comes to the main run loop will need to hold a
mutex for code generation. This also includes times when we need flush
code or jumps from the tb_cache.

DESIGN REQUIREMENT: Add locking around all code generation, patching
and jump cache modification
I don't think that you require a shared translation buffer between
cores to do this - although it *might* be the easiest way.
You could have a per-core translation buffer, the only requirement is
that most invalidation operations happen on all the buffers
(although that might depend on the emulated architecture).
With a per-core translation buffer, each core could generate new translations
without locking the other cores as long as no one is doing invalidations.
I agree it’s not a design requirement - however we’ve kind of gone round this 
loop in terms of getting things to work.
Fred will doubtless fill in some details, but basically it looks like making 
the TCG so you could run several in parallel is a nightmare. We seem to get 
reasonable performance having just one CPU at a time generating TBs.  At the 
same time, of course, the way Qemu is constructed there are actually several 
‘layers’ of buffer - from the CPU local ones through to the TB ‘pool’. So, 
actually, my accident or design, we benefit from a sort of caching structure.

True, it seems to be very complex at least on ARM because of the disassemble
context etc.. But on the other side the invalidation might be easier I guess.
For performance I'm not sure of what is the better way..

Fred
Memory maps and TLBs
--------------------

The memory handling code is fairly critical to the speed of memory
access in the emulated system.

  - Memory regions (dividing up access to PIO, MMIO and RAM)
  - Dirty page tracking (for code gen, migration and display)
  - Virtual TLB (for translating guest address->real address)

There is a both a fast path walked by the generated code and a slow
path when resolution is required. When the TLB tables are updated we
need to ensure they are done in a safe way by bringing all executing
threads to a halt before making the modifications.

DESIGN REQUIREMENTS:

  - TLB Flush All/Page
    - can be across-CPUs
    - will need all other CPUs brought to a halt
  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
    - This is a per-CPU table - by definition can't race
    - updated by it's own thread when the slow-path is forced

Emulated hardware state
-----------------------

Currently the hardware emulation has no protection against
multiple-accesses. However guest systems accessing emulated hardware
should be carrying out their own locking to prevent multiple CPUs
confusing the hardware. Of course there is no guarantee the there
couldn't be a broken guest that doesn't lock so you could get racing
accesses to the hardware.

There is the class of paravirtualized hardware (VIRTIO) that works in
a purely mmio mode. Often setting flags directly in guest memory as a
result of a guest triggered transaction.

DESIGN REQUIREMENTS:

  - Access to IO Memory should be serialised by an IOMem mutex
  - The mutex should be recursive (e.g. allowing pid to relock itself)

IO Subsystem
------------

The I/O subsystem is heavily used by KVM and has seen a lot of
improvements to offload I/O tasks to dedicated IOThreads. There should
be no additional locking required once we reach the Block Driver.

DESIGN REQUIREMENTS:

  - The dataplane should continue to be protected by the iothread locks
Watch out for where DMA invalidates the translated code.


need to check - that might be a great catch !

Cheers

Mark.

Dave


References
==========

[1] 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
[3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297



--
Alex Bennée
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

         +44 (0)20 7100 3485 x 210
  +33 (0)5 33 52 01 77x 210

        +33 (0)603762104
        mark.burton






reply via email to

[Prev in Thread] Current Thread [Next in Thread]