[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support
From: |
Tian, Kevin |
Subject: |
Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu |
Date: |
Thu, 12 May 2016 08:00:36 +0000 |
> From: Alex Williamson [mailto:address@hidden
> Sent: Thursday, May 12, 2016 6:06 AM
>
> On Wed, 11 May 2016 17:15:15 +0800
> Jike Song <address@hidden> wrote:
>
> > On 05/11/2016 12:02 AM, Neo Jia wrote:
> > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> > >>>> From: Song, Jike
> > >>>>
> > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > >>>> hardware. It just, as you said in another mail, "rather than
> > >>>> programming them into an IOMMU for a device, it simply stores the
> > >>>> translations for use by later requests".
> > >>>>
> > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be
> > >>>> disabled.
> > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > >>>> translations without any knowledge about hardware IOMMU, how is the
> > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > >>>> by the IOMMU backend here)?
> > >>>>
> > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > >>>> device model to figure out:
> > >>>>
> > >>>> 1, for a given GPA, how to avoid calling dma_map_page multiple
> > >>>> times?
> > >>>> 2, for which page to call dma_unmap_page?
> > >>>>
> > >>>> --
> > >>>
> > >>> We have to support both w/ iommu and w/o iommu case, since
> > >>> that fact is out of GPU driver control. A simple way is to use
> > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > >>> returned by dma_map_page.
> > >>>
> > >>
> > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> > >
> > > Hi Jike,
> > >
> > > With mediated passthru, you still can use hardware iommu, but more
> > > important
> > > that part is actually orthogonal to what we are discussing here as we
> > > will only
> > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>,
> > > once we
> > > have pinned pages later with the help of above info, you can map it into
> > > the
> > > proper iommu domain if the system has configured so.
> > >
> >
> > Hi Neo,
> >
> > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > but to find out whether a pfn was previously mapped or not, you have to
> > track it with another rbtree-alike data structure (the IOMMU driver simply
> > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > IOMMU backend we are discussing here.
> >
> > And it is also semantically correct for an IOMMU backend to handle both w/
> > and w/o an IOMMU hardware? :)
>
> A problem with the iommu doing the dma_map_page() though is for what
> device does it do this? In the mediated case the vfio infrastructure
> is dealing with a software representation of a device. For all we
> know that software model could transparently migrate from one physical
> GPU to another. There may not even be a physical device backing
> the mediated device. Those are details left to the vgpu driver itself.
This is a fair argument. VFIO iommu driver simply serves user space
requests, where only vaddr<->iova (essentially gpa in kvm case) is
mattered. How iova is mapped into real IOMMU is not VFIO's interest.
>
> Perhaps one possibility would be to allow the vgpu driver to register
> map and unmap callbacks. The unmap callback might provide the
> invalidation interface that we're so far missing. The combination of
> map and unmap callbacks might simplify the Intel approach of pinning the
> entire VM memory space, ie. for each map callback do a translation
> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> the translation. There's still the problem of where that dma_addr_t
> from the dma_map_page is stored though. Someone would need to keep
> track of iova to dma_addr_t. The vfio iommu might be a place to do
> that since we're already tracking information based on iova, possibly
> in an opaque data element provided by the vgpu driver. However, we're
> going to need to take a serious look at whether an rb-tree is the right
> data structure for the job. It works well for the current type1
> functionality where we typically have tens of entries. I think the
> NVIDIA model of sparse pinning the VM is pushing that up to tens of
> thousands. If Intel intends to pin the entire guest, that's
> potentially tens of millions of tracked entries and I don't know that
> an rb-tree is the right tool for that job. Thanks,
>
Based on above thought I'm thinking whether below would work:
(let's use gpa to replace existing iova in type1 driver, while using iova
for the one actually used in vGPU driver. Assume 'pin-all' scenario first
which matches existing vfio logic)
- No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
mapping, in coarse-grained regions;
- Leverage same page accounting/pinning logic in type1 driver, which
should be enough for 'pin-all' usage;
- Then main divergence point for vGPU would be in vfio_unmap_unpin
and vfio_iommu_map. I'm not sure whether it's easy to fake an
iommu_domain for vGPU so same iommu_map/unmap can be reused.
If not, we may introduce two new map/unmap callbacks provided
specifically by vGPU core driver, as you suggested:
* vGPU core driver uses dma_map_page to map specified pfns:
o When IOMMU is enabled, we'll get an iova returned different
from pfn;
o When IOMMU is disabled, returned iova is same as pfn;
* Then vGPU core driver just maintains its own gpa<->iova lookup
table (e.g. called vgpu_dma)
* Because each vfio_iommu_map invocation is about a contiguous
region, we can expect same number of vgpu_dma entries as maintained
for vfio_dma list;
Then it's vGPU core driver's responsibility to provide gpa<->iova
lookup for vendor specific GPU driver. And we don't need worry about
tens of thousands of entries. Once we get this simple 'pin-all' model
ready, then it can be further extended to support 'pin-sparse'
scenario. We still maintain a top-level vgpu_dma list with each entry to
further link its own sparse mapping structure. In reality I don't expect
we really need to maintain per-page translation even with sparse pinning.
Thanks
Kevin
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, (continued)
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Jike Song, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Tian, Kevin, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Tian, Kevin, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu,
Tian, Kevin <=
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Alex Williamson, 2016/05/12
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/12
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Jike Song, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Jike Song, 2016/05/15
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Tian, Kevin, 2016/05/12
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Alex Williamson, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Dong Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Dong Jia, 2016/05/13