qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] device assignment for embedded Power


From: Yoder Stuart-B08248
Subject: [Qemu-devel] device assignment for embedded Power
Date: Thu, 30 Jun 2011 15:59:55 +0000

One feature we need for QEMU/KVM on embedded Power Architecture is the 
ability to do passthru assignment of SoC I/O devices and memory.  An 
important use case in embedded is creating static partitions-- 
taking physical memory and I/O devices (non-PCI) and partitioning
them between the host Linux and several virtual machines.   Things like
live migration would not be needed or supported in these types of scenarios.

SoC devices do not sit on a probeable bus and there are no identifiers 
like 01:00.0 with PCI that we can use to identify devices--  the host
Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
device tree structure passed at boot.   QEMU needs to generate a
device tree to pass to the guest as well with all the guest's virtual
and physical resources.  Today a number of mostly complete guest device
trees are kept under ./pc-bios in QEMU, but this too static and
inflexible.

Some new mechanism is needed to assign SoC devices to guests, and we
(FSL + Alex Graf) have been discussing a few possible approaches
for doing this from QEMU and would like some feedback.

Some possibilities:

1. Option 1.  Pass the host dev tree to QEMU and assign devices
   by device tree path

     -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/address@hidden

   /soc/address@hidden is the device tree path to the assigned device.
   The device node 'address@hidden' has some number of properties (e.g. 
   address, interrupt info) and possibly subnodes under
   it.   QEMU copies that node when generating the guest dev tree.
   See snippet of entire node:  http://paste2.org/p/1496460

2. Option 2.  Pass the entire assigned device node as a string to
   QEMU

     -device assigned-soc-dev,dev=/address@hidden,dev-node='#address-cells = 
<1>;
      #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
      reg = <0xffe03000 0x100>; interrupts = <43 2>;
      interrupt-parent = <&mpic>; dfsrr;'

   This avoids needing to pass the host device tree, but could 
   get awkward-- the i2c example above is very simple, some device
   nodes are very large with a complex hierarchy of subnodes and 
   could be hundreds of lines of text to represent a single
   node.

It gets more complicated...

In some cases, modifications to device tree nodes may be needed.
An example-- sometimes a device tree property references another node 
and that relationship may not exist when assigned to a guest.
A "phy-handle" property may need to be deleted and a "fixed-link"
property added to a node representing a network device.

So in addition to assigning a device, a mechanism is needed to update 
device tree nodes.  So for the above example, maybe--

 -device assigned-soc-dev,dev=/soc/address@hidden,delete-prop=phy-handle,
  node-update="fixed-link = <2 1 1000 0 0>"

The types of modifications needed--  deleting nodes, deleting properties, 
adding nodes, adding properties, adding properties that reference other
nodes, changing properties. This device tree transformation mechanism
needed is general enough that it could apply to any device tree based
embedded platform (e.g. ARM, MIPS).

Another complexity relates to the IOMMU.  Here things get very company 
and IOMMU specific. Freescale has a proprietary IOMMU.
Devices have 1 or more logical I/O device numbers used to index into 
the IOMMU table. The IOMMU is limited in that it is designed to only 
support large, physically contiguous mappings per device.  It does not 
support any kind of page table.  The IOMMU hardware architecture 
assumes DMAs are typically targeted to just a few address regions.  
So, a common IOMMU setup for a device would be a device with a single 
IOMMU mapping covering the guest's main memory segment.  However, 
there are many much more complicated IOMMU setups that are common as 
well, such as doing "operation translations" where a device's write 
transaction is translated to "stash" directly into CPU caches.  We 
can't assume that all memory slots belonging to the guest are targets 
of DMA.

So for Freescale we would need some very Freescale-specific 
configuration mechanism to set up the IOMMU.  Here I think we would 
need the new qcfg approach to expressing nested
structures (http://wiki.qemu.org/Features/QCFG).   Device
assignment with IOMMU set up might look like the examples
below:

# device with multiple logical i/o device numbers

-device assigned-soc-dev,dev=/qman-portals/address@hidden,
vcpu=1,fsl,iommu.stash-mem={
dma-window.guest-addr=0x0,
dma-window.size=0x100000000,
liodn-index=1,
operation-mapping=0
stash-dest=1},
fsl,iommu.stash-dqrr={
dma-window.guest-addr=0xff4200000,
dma-window.size=0x4000,
liodn-index=0,
operation-mapping=0
stash-dest=1}

# assign pci-bus to a guest with multiple memory # regions
#    addr       size
#    0x0         512MB
#    0x20000000  4KB  (for MSIs)
#    0x40000000  16MB (shared memory)
#    0xc0000000  64MB (shared memory)

-device assigned-soc-dev,dev=/address@hidden,
fsl,iommu={dma-window.guest-addr=0x0,
dma-window.size=0x100000000,
dma-window.subwindow-count =8,
dma-window.sub-window.0.guest-addr=0x0,
dma-window.sub-window.0.size=0x20000000,
dma-window.sub-window.1.guest-addr=0x20000000,
dma-window.sub-window.1.size=0x4000,
dma-window.sub-window.1.pci-msi-subwindow,
dma-window.sub-window.2.guest-addr. 0x40000000, 
dma-window.sub-window.2.size=0x01000000,
dma-window.sub-window.3.guest-addr. 0xc0000000, 
dma-window.sub-window.3.size=0x04000000}

The above are from some real examples based on the SoC device 
assignment mechanisms in the Freescale Embedded Hypervisor.

A final thing...

Both options 1 and 2 above introduce an implementation complexity--
both need to be able to parse text device tree syntax format.  In option
2 since the entire node is passed as text.  And both options for doing
complex node updates.  QEMU would need to do syntactic and semantic
parsing of DTS syntax, basically needing parts of the front end of
dtc (the device tree compiler-- http://git.jdl.com/gitweb/).

Option 3.  So a 3rd approach could be an extension of options 1
or 2.  Instead of expressing nodes in ascii DTS format requiring
parsing, pass a compiled file in device tree binary format to QEMU
that expresses the Qdev properties.

So instead of:
 -device assigned-soc-dev,dev=/soc/address@hidden,delete-prop=phy-handle,
  node-update="fixed-link = <2 1 1000 0 0>"

You might have a config file containing:

ethernet0 {
   compatible = "device";
   type = "assigned-soc-dev";
   dev = "/soc/address@hidden";
   node-update {
      delete-prop="phy-handle";
      fixed-link = <2 1 1000 0 0>";
   }; 
};

You would compile the file into a DTB and then pass it to QEMU:

   -config-dtb ./myguest.dtb

The above is a very simple example-- the benefit of this approach is
in the much more complicated node updates that are sometimes needed.

The config-dtb is just an alternate way of getting complex
device tree data into QEMU.  It supplements and does not change
existing QEMU architecture.

Some pluses of this approach:
   -avoids pulling in substantial complexity for parsing DTS
    syntax
   -device tree nodes are represented in their "native" DTB
    format
   -an available user space library (libfdt) is already part
    of QEMU for parsing DTBs
   -greatly simplifies handling node updates where node reference other
    nodes
   -could use either option 1 (assign node by reference) or option 2
    (assign node by
   -we've used an approach similar to this in the Freescale Embedded
    Hypervisor for 3+ years now and it's held up well


Regards,
Stuart Yoder




reply via email to

[Prev in Thread] Current Thread [Next in Thread]