[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[RFC 4/9] hw/misc/memexpose: Add documentation
i . kotrasinsk
[RFC 4/9] hw/misc/memexpose: Add documentation
Tue, 4 Feb 2020 12:30:46 +0100
From: Igor Kotrasinski <address@hidden>
Signed-off-by: Igor Kotrasinski <address@hidden>
docs/specs/memexpose-spec.txt | 168 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 168 insertions(+)
create mode 100644 docs/specs/memexpose-spec.txt
diff --git a/docs/specs/memexpose-spec.txt b/docs/specs/memexpose-spec.txt
new file mode 100644
@@ -0,0 +1,168 @@
+= Specification for Inter-VM memory region sharing device =
+The inter-VM memory region sharing device (memexpose) is designed to allow two
+QEMU devices to share arbitrary physical memory regions between one another, as
+well as pass simple interrupts. It attempts to share memory regions directly
+when feasible, falling back to MMIO via socket communication when it's not.
+The device is modeled by QEMU as a PCI device, as well as a memory
+region/interrupt directly usable on platforms like ARM, with an entry in the
+An example use case for memexpose is forwarding ARM Trustzone functionality
+between two VMs running different architectures - one running a rich OS on an
+x86_64 VM, the other running the trusted OS on an ARM VM. In this scenario,
+sharing arbitrary memory regions allows such forwarding to work with minimal
+changes to the trusted OS.
+== Configuring the memexpose device ==
+The device uses two character devices to communicate with the other VM - one
+synchronous memory accesses, another for passing interrupts. A typical
+configuration of the PCI device looks like this:
+ -chardev socket,...,path=/tmp/qemu-memexpose-mem,id="mem" \
+ -chardev socket,...,path=/tmp/qemu-memexpose-intr,id="intr" \
+While the arm-virt machine device can be enabled like this:
+ -chardev socket,...,path=/tmp/qemu-memexpose-mem,id="mem-mem" \
+ -chardev socket,...,path=/tmp/qemu-memexpose-intr,id="mem-intr" \
+ -machine memexpose-ep=mem,memexpose-size=0xN...
+Normally one of the VMs would have 'server,nowait' options set on these
+At the moment the memory exposed to the other device always starts at 0
+(relative to system_memory). The shm_size/memexpose-size property indicates the
+size of the exposed region.
+The *_chardev/memexpose-ep properties are used to point the memexpose device to
+chardevs used to communicate with the other VM.
+== Memexpose PCI device interface ===
+The device has vendor ID 1af4, device ID 1111, revision 0.
+=== PCI BARs ===
+The device has two BARs:
+- BAR0 holds device registers and interrupt data (0x1000 byte MMIO),
+- BAR1 maps memory from the other VM.
+To use the device, you must first enable it by writing 1 to BAR0 at address 0.
+This makes QEMU wait for another VM to connect. Once that is done, you can
+access the other machine's memory via BAR1.
+Interrupts can be sent and received by configuring the device for interrupts
+reading and writing to registers in BAR0.
+=== Device registers ===
+BAR 0 has following registers:
+ Offset Size Access On reset Function
+ 0 8 read/write 0 Enable/disable device
+ bit 0: device enabled / disabled
+ bit 1..63: reserved
+ 0x400 8 read/write 0 Interrupt RX address
+ bit 1: interrupt read
+ bit 1..63: reserved
+ 0x408 8 read-only UD RX Interrupt type
+ 0x410 128 read-only UD RX Interrupt data
+ 0x800 8 read/write 0 Interrupt TX address
+ 0x808 8 write-only N/A TX Interrupt type
+ 0x810 128 write-only N/A TX Interrupt data
+All other addresses are reserved.
+=== Handling interrupts ===
+To send interrupts, write to TX interrupt address. Contents of TX interrupt
+and data regions will be send along with the interrupt. The device is holding
+internal queue of 16 interrupts, any extra interrupts are silently dropped.
+To receive interrupts, read the interrupt RX address. If the value is 1, then
+RX interrupt type and data registers contain the data / type sent by the other
+VM. Otherwise (the value is 0), no more interrupts are queued and RX interrupt
+type/data register contents are undefined.
+=== Platform device protocol ===
+The other memexpose device type (provided on e.g. ARM via device tree) is
+essentially identical to the PCI device. It provides two memory ranges that
+exactly like the PCI BAR regions and an interrupt for signaling an interrupt
+from the other VM.
+== Memexpose peer protocol ==
+This section describes the current memexpose protocol. It is a WIP and likely
+A connection between two VMs connected via memexpose happens on two sockets -
+interrupt socket and a memory socket. All communication on the earlier is
+asynchronous, while communication on the latter is synchronous.
+When the device is enabled, QEMU waits for memexpose's chardevs to connect. No
+messages are exchanged upon connection. After devices are connected, the
+following messages can be exchanged:
+1. Interrupt message, via interrupt socket. This message contains interrupt
+ and data.
+2. Memory access request message, via memory socket. It contains a target
+ address, access size and valueto write in case of writes.
+3. Memory access return message. This contains an access result (as
+ MemTxResult) and a value in case of reads. If the accessed region can be
+ shared directly, then this region's start, size and shmem file descriptor
+ also sent.
+4. Memory invalidation message. This is sent when a VM's memory region changes
+ status and contains such region's start and size. The other VM is expected
+ drop any shared regions overlapping with it.
+5. Memory invalidation response. This is sent in response to a memory
+ invalidation message; after receiving this the remote VM is guaranteed have
+ scheduled region invalidation before accessing the region again.
+As QEMU performes memory accesses synchronously, we want to perform memory
+invalidation before returning to guest OS and both VMs might try to perform a
+remote memory access at the same time, all messages passed via the memory
+have an associated priority.
+At any time, only one message with a given priority is in flight. After sending
+a message, the VM reads messages on the memory socket, servicing all messages
+with a priority higher than its own. Once it receives a message with a priority
+lower than its own, it waits for a response to its own message before servicing
+it. This guarantees no deadlocks, assuming that messages don't trigger further
+messages. Message priorities, from highest to lowest, are as follows:
+1. Memory invalidation message/response.
+2. Memory access message/response.
+Additionally, one of the VMs is assigned a sub-priority higher than another, so
+that its messages of the same type have priority over the other VM's messages.
+Memory access messages have the lowest priority in order to guarantee that QEMU
+will not attempt to access memory while in the middle of a memory region
+=== Memexpose memory sharing ===
+This section describes the memexpose memory sharing mechanism.
+Memory sharing is implemented lazily, initially no memory regions are shared
+between devices. When a memory access is performed via a socket, the remote VM
+checks whether the underlying memory range is backed by shareable memory. If it
+is, the VM finds out the maximum contiguous flat range backed by this region
+sends its file descriptor to the local VM, where it is mapped as a subregion.
+The memexpose device registers memory listeners for the memory region it's
+using. Whenever a flat range for this region (that is not this device's
+subregion) changes, that range is sent to the other VM and any directly shared
+memory region intersecting this range is scheduled for removal via a BH.