[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support
From: |
Dong Jia |
Subject: |
Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu |
Date: |
Fri, 13 May 2016 16:39:37 +0800 |
On Fri, 13 May 2016 00:24:34 -0700
Neo Jia <address@hidden> wrote:
> On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > On Thu, 12 May 2016 13:05:52 -0600
> > Alex Williamson <address@hidden> wrote:
> >
> > > On Thu, 12 May 2016 08:00:36 +0000
> > > "Tian, Kevin" <address@hidden> wrote:
> > >
> > > > > From: Alex Williamson [mailto:address@hidden
> > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > >
> > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > Jike Song <address@hidden> wrote:
> > > > >
> > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:
> > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> > > > > > >>>> From: Song, Jike
> > > > > > >>>>
> > > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying
> > > > > > >>>> IOMMU
> > > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > > >>>> programming them into an IOMMU for a device, it simply stores
> > > > > > >>>> the
> > > > > > >>>> translations for use by later requests".
> > > > > > >>>>
> > > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must
> > > > > > >>>> be disabled.
> > > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually
> > > > > > >>>> programs
> > > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or
> > > > > > >>>> dma_map_page;
> > > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <->
> > > > > > >>>> HPA
> > > > > > >>>> translations without any knowledge about hardware IOMMU, how
> > > > > > >>>> is the
> > > > > > >>>> device model supposed to do to get an IOVA for a given GPA
> > > > > > >>>> (thereby HPA
> > > > > > >>>> by the IOMMU backend here)?
> > > > > > >>>>
> > > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates,
> > > > > > >>>> it
> > > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult
> > > > > > >>>> for the
> > > > > > >>>> device model to figure out:
> > > > > > >>>>
> > > > > > >>>> 1, for a given GPA, how to avoid calling dma_map_page
> > > > > > >>>> multiple times?
> > > > > > >>>> 2, for which page to call dma_unmap_page?
> > > > > > >>>>
> > > > > > >>>> --
> > > > > > >>>
> > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > > >>> returned by dma_map_page.
> > > > > > >>>
> > > > > > >>
> > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility
> > > > > > >> here?
> > > > > > >
> > > > > > > Hi Jike,
> > > > > > >
> > > > > > > With mediated passthru, you still can use hardware iommu, but
> > > > > > > more important
> > > > > > > that part is actually orthogonal to what we are discussing here
> > > > > > > as we will only
> > > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu)
> > > > > > > va>, once we
> > > > > > > have pinned pages later with the help of above info, you can map
> > > > > > > it into the
> > > > > > > proper iommu domain if the system has configured so.
> > > > > > >
> > > > > >
> > > > > > Hi Neo,
> > > > > >
> > > > > > Technically yes you can map a pfn into the proper IOMMU domain
> > > > > > elsewhere,
> > > > > > but to find out whether a pfn was previously mapped or not, you
> > > > > > have to
> > > > > > track it with another rbtree-alike data structure (the IOMMU driver
> > > > > > simply
> > > > > > doesn't bother with tracking), that seems somehow duplicate with
> > > > > > the vGPU
> > > > > > IOMMU backend we are discussing here.
> > > > > >
> > > > > > And it is also semantically correct for an IOMMU backend to handle
> > > > > > both w/
> > > > > > and w/o an IOMMU hardware? :)
> > > > >
> > > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > > device does it do this? In the mediated case the vfio infrastructure
> > > > > is dealing with a software representation of a device. For all we
> > > > > know that software model could transparently migrate from one physical
> > > > > GPU to another. There may not even be a physical device backing
> > > > > the mediated device. Those are details left to the vgpu driver
> > > > > itself.
> > > >
> > > > This is a fair argument. VFIO iommu driver simply serves user space
> > > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > >
> > > > >
> > > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > > map and unmap callbacks. The unmap callback might provide the
> > > > > invalidation interface that we're so far missing. The combination of
> > > > > map and unmap callbacks might simplify the Intel approach of pinning
> > > > > the
> > > > > entire VM memory space, ie. for each map callback do a translation
> > > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > > the translation. There's still the problem of where that dma_addr_t
> > > > > from the dma_map_page is stored though. Someone would need to keep
> > > > > track of iova to dma_addr_t. The vfio iommu might be a place to do
> > > > > that since we're already tracking information based on iova, possibly
> > > > > in an opaque data element provided by the vgpu driver. However, we're
> > > > > going to need to take a serious look at whether an rb-tree is the
> > > > > right
> > > > > data structure for the job. It works well for the current type1
> > > > > functionality where we typically have tens of entries. I think the
> > > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > > > thousands. If Intel intends to pin the entire guest, that's
> > > > > potentially tens of millions of tracked entries and I don't know that
> > > > > an rb-tree is the right tool for that job. Thanks,
> > > > >
> > > >
> > > > Based on above thought I'm thinking whether below would work:
> > > > (let's use gpa to replace existing iova in type1 driver, while using
> > > > iova
> > > > for the one actually used in vGPU driver. Assume 'pin-all' scenario
> > > > first
> > > > which matches existing vfio logic)
> > > >
> > > > - No change to existing vfio_dma structure. VFIO still maintains
> > > > gpa<->vaddr
> > > > mapping, in coarse-grained regions;
> > > >
> > > > - Leverage same page accounting/pinning logic in type1 driver, which
> > > > should be enough for 'pin-all' usage;
> > > >
> > > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > > > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> > >
> > > This seems troublesome. Kirti's version used numerous api-only tests
> > > to avoid these which made the code difficult to trace. Clearly one
> > > option is to split out the common code so that a new mediated-type1
> > > backend skips this, but they thought they could clean it up without
> > > this, so we'll see what happens in the next version.
> > >
> > > > If not, we may introduce two new map/unmap callbacks provided
> > > > specifically by vGPU core driver, as you suggested:
> > > >
> > > > * vGPU core driver uses dma_map_page to map specified pfns:
> > > >
> > > > o When IOMMU is enabled, we'll get an iova returned
> > > > different
> > > > from pfn;
> > > > o When IOMMU is disabled, returned iova is same as pfn;
> > >
> > > Either way each iova needs to be stored and we have a worst case of one
> > > iova per page of guest memory.
> > >
> > > > * Then vGPU core driver just maintains its own gpa<->iova lookup
> > > > table (e.g. called vgpu_dma)
> > > >
> > > > * Because each vfio_iommu_map invocation is about a contiguous
> > > > region, we can expect same number of vgpu_dma entries as maintained
> > > > for vfio_dma list;
> > > >
> > > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > > lookup for vendor specific GPU driver. And we don't need worry about
> > > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > > ready, then it can be further extended to support 'pin-sparse'
> > > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > > further link its own sparse mapping structure. In reality I don't expect
> > > > we really need to maintain per-page translation even with sparse
> > > > pinning.
> > >
> > > If you're trying to equate the scale of what we need to track vs what
> > > type1 currently tracks, they're significantly different. Possible
> > > things we need to track include the pfn, the iova, and possibly a
> > > reference count or some sort of pinned page map. In the pin-all model
> > > we can assume that every page is pinned on map and unpinned on unmap,
> > > so a reference count or map is unnecessary. We can also assume that we
> > > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > > we don't need to track that. I don't see any way around tracking the
> > > iova. The iommu can't tell us this like it can with the normal type1
> > > model because the pfn is the result of the translation, not the key for
> > > the translation. So we're always going to have between 1 and
> > > (size/PAGE_SIZE) iova entries per vgpu_dma entry. You might be able to
> > > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > > data structure tracking every iova.
> > >
> > > Sparse mapping has the same issue but of course the tree of iovas is
> > > potentially incomplete and we need a way to determine where it's
> > > incomplete. A page table rooted in the vgpu_dma and indexed by the
> > > offset from the start vaddr seems like the way to go here. It's also
> > > possible that some mediated device models might store the iova in the
> > > command sent to the device and therefore be able to parse those entries
> > > back out to unmap them without storing them separately. This might be
> > > how the s390 channel-io model would prefer to work.
> > Dear Alex:
> >
> > For the s390 channel-io model, when an I/O instruction was intercepted
> > and issued to the device driver for further translation, the operand of
> > the instruction contents iovas only. Since iova is the key to locate an
> > entry in the database (r-b tree or whatever), we do can parse the
> > entries back out one by one when doing the unmap operation.
> > ^^^^^^^^^^
> >
> > BTW, if the mediated-iommu backend can potentially offer a transaction
> > level support for the unmap operation, I believe it will benefit the
> > performance for this case.
> >
> > e.g.:
> > handler = vfio_trasaction_begin();
> > foreach(iova in the command) {
> > pfn = vfio_trasaction_map(handler, iova);
> > do_io(pfn);
> > }
>
> Hi Dong,
>
> Could you please help me understand the performance benefit here?
>
> Is the perf argument coming from the searching of rbtree of the tracking data
> structure or something else?
>
> For example you can do similar thing by the following sequence from your
> backend
> driver:
>
> vfio_pin_pages(gfn_list/iova_list /* in */, npages, prot, pfn_bases /*
> out */)
> foreach (pfn)
> do_io(pfn)
> vfio_unpin_pages(pfn_bases)
Dear Neo:
FWIU, the channel-io driver could leverage these interfaces without
obvious feasibility issues. Since the implementation of the current
vfio_unpin_pages iterates @pfn_bases and find the corresponding entry
from the rb tree for each of the pfn_base, I'm wondering if a dedicated
list of the entries for the whole @pfn_bases could offer us some
benefits. I have to say that I know it's too early to consider the perf
, and the current interfaces are fine for the channel-io case.
I'm also wondering if such an idea could contribute a little to your
discussion of the management of the key-value mapping issue. If this is
just a noise (sorry for that :<), please ignore it.
My major intention is to show up, and to elaborate a bit about the
channel-io use case. So you will see that, there is really another user
of the mediate-iommu backend, and as Alex mentioned before, getting rid
of the 'vgpu_dev' and other vgpu specific stuff is indeed necessary. :>
>
> Thanks,
> Neo
>
> >
> > /*
> > * Expect to unmap all of the pfns mapped in this trasaction with the
> > * next statement. The mediated-iommu backend could use handler as the
> > * key to track the list of the entries.
> > */
> > vfio_trasaction_unmap(handler);
> > vfio_trasaction_end(handler);
> >
> > Not sure if this could benefit the vgpu sparse mapping use case though.
>
>
>
>
>
> >
> > > That seems like
> > > further validation that such tracking is going to be dependent on the
> > > mediated driver itself and probably not something to centralize in a
> > > mediated iommu driver. Thanks,
> > >
> > > Alex
> > >
> >
> >
> >
> > --------
> > Dong Jia
> >
>
--------
Dong Jia
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, (continued)
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Tian, Kevin, 2016/05/12
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Alex Williamson, 2016/05/12
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/12
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Jike Song, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Jike Song, 2016/05/15
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Tian, Kevin, 2016/05/12
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Alex Williamson, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Dong Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu,
Dong Jia <=
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Neo Jia, 2016/05/13
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Dong Jia, 2016/05/19
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Tian, Kevin, 2016/05/19
- Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu, Kirti Wankhede, 2016/05/05
Re: [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support, Tian, Kevin, 2016/05/03