[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] device assignment for embedded Power
From: |
Yoder Stuart-B08248 |
Subject: |
[Qemu-devel] device assignment for embedded Power |
Date: |
Thu, 30 Jun 2011 15:59:55 +0000 |
One feature we need for QEMU/KVM on embedded Power Architecture is the
ability to do passthru assignment of SoC I/O devices and memory. An
important use case in embedded is creating static partitions--
taking physical memory and I/O devices (non-PCI) and partitioning
them between the host Linux and several virtual machines. Things like
live migration would not be needed or supported in these types of scenarios.
SoC devices do not sit on a probeable bus and there are no identifiers
like 01:00.0 with PCI that we can use to identify devices-- the host
Linux kernel is made aware of SoC I/O devices from nodes/properties in a
device tree structure passed at boot. QEMU needs to generate a
device tree to pass to the guest as well with all the guest's virtual
and physical resources. Today a number of mostly complete guest device
trees are kept under ./pc-bios in QEMU, but this too static and
inflexible.
Some new mechanism is needed to assign SoC devices to guests, and we
(FSL + Alex Graf) have been discussing a few possible approaches
for doing this from QEMU and would like some feedback.
Some possibilities:
1. Option 1. Pass the host dev tree to QEMU and assign devices
by device tree path
-dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/address@hidden
/soc/address@hidden is the device tree path to the assigned device.
The device node 'address@hidden' has some number of properties (e.g.
address, interrupt info) and possibly subnodes under
it. QEMU copies that node when generating the guest dev tree.
See snippet of entire node: http://paste2.org/p/1496460
2. Option 2. Pass the entire assigned device node as a string to
QEMU
-device assigned-soc-dev,dev=/address@hidden,dev-node='#address-cells =
<1>;
#size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
reg = <0xffe03000 0x100>; interrupts = <43 2>;
interrupt-parent = <&mpic>; dfsrr;'
This avoids needing to pass the host device tree, but could
get awkward-- the i2c example above is very simple, some device
nodes are very large with a complex hierarchy of subnodes and
could be hundreds of lines of text to represent a single
node.
It gets more complicated...
In some cases, modifications to device tree nodes may be needed.
An example-- sometimes a device tree property references another node
and that relationship may not exist when assigned to a guest.
A "phy-handle" property may need to be deleted and a "fixed-link"
property added to a node representing a network device.
So in addition to assigning a device, a mechanism is needed to update
device tree nodes. So for the above example, maybe--
-device assigned-soc-dev,dev=/soc/address@hidden,delete-prop=phy-handle,
node-update="fixed-link = <2 1 1000 0 0>"
The types of modifications needed-- deleting nodes, deleting properties,
adding nodes, adding properties, adding properties that reference other
nodes, changing properties. This device tree transformation mechanism
needed is general enough that it could apply to any device tree based
embedded platform (e.g. ARM, MIPS).
Another complexity relates to the IOMMU. Here things get very company
and IOMMU specific. Freescale has a proprietary IOMMU.
Devices have 1 or more logical I/O device numbers used to index into
the IOMMU table. The IOMMU is limited in that it is designed to only
support large, physically contiguous mappings per device. It does not
support any kind of page table. The IOMMU hardware architecture
assumes DMAs are typically targeted to just a few address regions.
So, a common IOMMU setup for a device would be a device with a single
IOMMU mapping covering the guest's main memory segment. However,
there are many much more complicated IOMMU setups that are common as
well, such as doing "operation translations" where a device's write
transaction is translated to "stash" directly into CPU caches. We
can't assume that all memory slots belonging to the guest are targets
of DMA.
So for Freescale we would need some very Freescale-specific
configuration mechanism to set up the IOMMU. Here I think we would
need the new qcfg approach to expressing nested
structures (http://wiki.qemu.org/Features/QCFG). Device
assignment with IOMMU set up might look like the examples
below:
# device with multiple logical i/o device numbers
-device assigned-soc-dev,dev=/qman-portals/address@hidden,
vcpu=1,fsl,iommu.stash-mem={
dma-window.guest-addr=0x0,
dma-window.size=0x100000000,
liodn-index=1,
operation-mapping=0
stash-dest=1},
fsl,iommu.stash-dqrr={
dma-window.guest-addr=0xff4200000,
dma-window.size=0x4000,
liodn-index=0,
operation-mapping=0
stash-dest=1}
# assign pci-bus to a guest with multiple memory # regions
# addr size
# 0x0 512MB
# 0x20000000 4KB (for MSIs)
# 0x40000000 16MB (shared memory)
# 0xc0000000 64MB (shared memory)
-device assigned-soc-dev,dev=/address@hidden,
fsl,iommu={dma-window.guest-addr=0x0,
dma-window.size=0x100000000,
dma-window.subwindow-count =8,
dma-window.sub-window.0.guest-addr=0x0,
dma-window.sub-window.0.size=0x20000000,
dma-window.sub-window.1.guest-addr=0x20000000,
dma-window.sub-window.1.size=0x4000,
dma-window.sub-window.1.pci-msi-subwindow,
dma-window.sub-window.2.guest-addr. 0x40000000,
dma-window.sub-window.2.size=0x01000000,
dma-window.sub-window.3.guest-addr. 0xc0000000,
dma-window.sub-window.3.size=0x04000000}
The above are from some real examples based on the SoC device
assignment mechanisms in the Freescale Embedded Hypervisor.
A final thing...
Both options 1 and 2 above introduce an implementation complexity--
both need to be able to parse text device tree syntax format. In option
2 since the entire node is passed as text. And both options for doing
complex node updates. QEMU would need to do syntactic and semantic
parsing of DTS syntax, basically needing parts of the front end of
dtc (the device tree compiler-- http://git.jdl.com/gitweb/).
Option 3. So a 3rd approach could be an extension of options 1
or 2. Instead of expressing nodes in ascii DTS format requiring
parsing, pass a compiled file in device tree binary format to QEMU
that expresses the Qdev properties.
So instead of:
-device assigned-soc-dev,dev=/soc/address@hidden,delete-prop=phy-handle,
node-update="fixed-link = <2 1 1000 0 0>"
You might have a config file containing:
ethernet0 {
compatible = "device";
type = "assigned-soc-dev";
dev = "/soc/address@hidden";
node-update {
delete-prop="phy-handle";
fixed-link = <2 1 1000 0 0>";
};
};
You would compile the file into a DTB and then pass it to QEMU:
-config-dtb ./myguest.dtb
The above is a very simple example-- the benefit of this approach is
in the much more complicated node updates that are sometimes needed.
The config-dtb is just an alternate way of getting complex
device tree data into QEMU. It supplements and does not change
existing QEMU architecture.
Some pluses of this approach:
-avoids pulling in substantial complexity for parsing DTS
syntax
-device tree nodes are represented in their "native" DTB
format
-an available user space library (libfdt) is already part
of QEMU for parsing DTBs
-greatly simplifies handling node updates where node reference other
nodes
-could use either option 1 (assign node by reference) or option 2
(assign node by
-we've used an approach similar to this in the Freescale Embedded
Hypervisor for 3+ years now and it's held up well
Regards,
Stuart Yoder
- [Qemu-devel] device assignment for embedded Power,
Yoder Stuart-B08248 <=