[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v

From: Alexey Kardashevskiy
Subject: Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU
Date: Wed, 23 Mar 2016 14:06:36 +1100
User-agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.0

On 03/23/2016 01:53 PM, David Gibson wrote:
On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
On 03/23/2016 12:08 PM, David Gibson wrote:
On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
On 03/22/2016 04:14 PM, David Gibson wrote:
On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses euristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <address@hidden>
* new to the series

* export levels to PHB
  hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
  trace-events     |   2 ++
  2 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4e873b7..421d6eb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
      return 0;

+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
+    VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 

The hard-coded 0x1000 looks dubious..

Well, that's the minimal page size...

Really?  Some BookE CPUs support 1KiB page size..

Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)

Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
it's been done for CPU MMU I wouldn't count on it not being done for

1 is a safer choice.

+    g_assert(hiommu);
+    QLIST_REMOVE(hiommu, hiommu_next);
  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
      return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener 
      end = int128_get64(llend);

+    if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {

I think this would be clearer split out into a helper function,
vfio_create_host_window() or something.

It is rather vfio_spapr_create_host_window() and we were avoiding
xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
separate file but this usually triggers more discussion and never ends well.

+        unsigned entries, pages;
+        struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) 
+        g_assert(section->mr->iommu_ops);
+        g_assert(memory_region_is_iommu(section->mr));

I don't think you need these asserts.  AFAICT the same logic should
work if a RAM MR was added directly to PCI address space - this would
create the new host window, then the existing code for adding a RAM MR
would map that block of RAM statically into the new window.

In what configuration/machine can we do that on SPAPR?

spapr guests won't ever do that.  But you can run an x86 guest on a
powernv host and this situation could come up.

I am pretty sure VFIO won't work in this case anyway.

I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.

This is not about TCG (pseries TCG guest works with VFIO on powernv host), this is about things like VFIO_IOMMU_GET_INFO vs. VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.

Should I add such support in this patchset?

In any case there's no point asserting if the code is correct anyway.

Assert here says (at least) "not tested" or "not expected to


+        trace_vfio_listener_region_add_iommu(iova, end - 1);
+        /*
+         * FIXME: For VFIO iommu types which have KVM acceleration to
+         * avoid bouncing all map/unmaps through qemu this way, this
+         * would be the right place to wire that up (tell the KVM
+         * device emulation the VFIO iommu handles to use).
+         */
+        create.window_size = memory_region_size(section->mr);
+        create.page_shift =
+                ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));

Ah.. except that I guess you'd need to fall back to host page size
here to handle a RAM MR.

Can you give an example of such RAM MR being added to PCI AS on

On spapr, no.  But you can run other machine types as guests (at least
with TCG) on a host with the spapr IOMMU.

+        /*
+         * SPAPR host supports multilevel TCE tables, there is some
+         * euristic to decide how many levels we want for our table:
+         * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+         */
+        entries = create.window_size >> create.page_shift;
+        pages = (entries * sizeof(uint64_t)) / getpagesize();
+        create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+        if (ret) {
+            error_report("Failed to create a window, ret = %d (%m)", ret);
+            goto fail;
+        }
+        if (create.start_addr != section->offset_within_address_space ||
+            vfio_host_iommu_lookup(container, create.start_addr,
+                                   create.start_addr + create.window_size - 
1)) {

Under what circumstances can this trigger?  Is the kernel ioctl
allowed to return a different window start address than the one

You already asked this some time ago :) The userspace cannot request
address, the host kernel returns one.

Ok.  For generality it would be nice if you could succeed here as long
as the new host window covers the requested guest window, even if it
doesn't match exactly.  And for that matter to not request the new
window if the host already has a window covering the guest region.

That would be dead code - when would it possibly work? I mean I could
instrument an artificial test but the actual user which might appear later
will likely be soooo different so this won't help anyway.

Hmm, I suppose.  It actually shouldn't be that hard to trigger a case
like this, if you just bumped the bridge's dma64 base address property
up a little bit - above the host kernel's base address, but small
enough that you can still easily fit the guest memory in.

I can test it today for sure but once committed, we will have to support it. Which I am trying to avoid until we get clear picture what we are supporting here.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]