qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2


From: Alex Williamson
Subject: Re: [Qemu-devel] [RFC PATCH qemu v3 4/4] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
Date: Sun, 19 Jul 2015 09:04:24 -0600

On Sun, 2015-07-19 at 01:05 +1000, David Gibson wrote:
> On Fri, Jul 17, 2015 at 12:25:31PM -0600, Alex Williamson wrote:
> > On Fri, 2015-07-17 at 15:20 +1000, David Gibson wrote:
> > > On Thu, Jul 16, 2015 at 08:44:59AM -0600, Alex Williamson wrote:
> > > > On Thu, 2015-07-16 at 15:11 +1000, David Gibson wrote:
> > > > > On Tue, Jul 14, 2015 at 10:21:54PM +1000, Alexey Kardashevskiy wrote:
> > > > > > This makes use of the new "memory registering" feature. The idea is
> > > > > > to provide the userspace ability to notify the host kernel about 
> > > > > > pages
> > > > > > which are going to be used for DMA. Having this information, the 
> > > > > > host
> > > > > > kernel can pin them all once per user process, do locked pages
> > > > > > accounting (once) and not spent time on doing that in real time with
> > > > > > possible failures which cannot be handled nicely in some cases.
> > > > > > 
> > > > > > This adds a guest RAM memory listener which notifies a VFIO 
> > > > > > container
> > > > > > about memory which needs to be pinned/unpinned. VFIO MMIO regions
> > > > > > (i.e. "skip dump" regions) are skipped.
> > > > > > 
> > > > > > The feature is only enabled for SPAPR IOMMU v2. The host kernel 
> > > > > > changes
> > > > > > are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, 
> > > > > > this does
> > > > > > not call it when v2 is detected and enabled.
> > > > > > 
> > > > > > This does not change the guest visible interface.
> > > > > > 
> > > > > > Signed-off-by: Alexey Kardashevskiy <address@hidden>
> > > > > 
> > > > > I've looked at this in more depth now, and attempting to unify the
> > > > > pre-reg and mapping listeners like this can't work - they need to be
> > > > > listening on different address spaces:  mapping actions need to be
> > > > > listening on the PCI address space, whereas the pre-reg needs to be
> > > > > listening on address_space_memory.  For x86 - for now - those end up
> > > > > being the same thing, but on Power they're not.
> > > > > 
> > > > > We do need to be clear about what differences are due to the presence
> > > > > of a guest IOMMU versus which are due to arch or underlying IOMMU
> > > > > type.  For now Power has a guest IOMMU and x86 doesn't, but that could
> > > > > well change in future: we could well implement the guest side IOMMU
> > > > > for x86 in future (or x86 could invent a paravirt IOMMU interface).
> > > > > On the other side, BenH's experimental powernv machine type could
> > > > > introduce Power machines without a guest side IOMMU (or at least an
> > > > > optional guest side IOMMU).
> > > > > 
> > > > > The quick and dirty approach here is:
> > > > >    1. Leave the main listener as is
> > > > >    2. Add a new pre-reg notifier to the spapr iommu specific code,
> > > > >       which listens on address_space_memory, *not* the PCI space
> > > > 
> > > > It is dirty and that's exactly what I've been advising Alexey against
> > > > because we have entirely too much dirty spapr specific code that doesn't
> > > > need to be spapr specific.  I don't see why separate address space
> > > > matters, that's done at the point of registering the listener and so far
> > > > doesn't play much a role in the actual listener behavior, just which
> > > > regions it sees.
> > > 
> > > Well, there's two parts to this - the different address spaces means
> > > they need to be different listener instances.  They also need
> > > different callbacks - or at least parameterized callback behaviour
> > > because they do different things (one maps, the other preregs).
> 
> Ugh.  I'm pretty sure we're talking at cross purposes here, but I
> haven't figured out where the disconnect is.  So I'm not sure my
> comments below will be coherent, but let's see.
> 
> > Yep, and Alexey's doing that with a switch statement, which seems like
> > it's helping to share setup w/o being too ugly.
> 
> Which switch are you referring to?  There isn't one in
> vfio_listener_region_add().  I'm talking about the current upstream
> code here, not with Alexey's ddw patches - my whole point here is I
> think the current draft is taking the wrong approach.

I'm not sure how we're going to improve on the current draft or disagree
on its direction, if we're not going to take into account the changes
that it makes, such as the added switch statements.  In my opinion, the
series here is starting to head in a workable direction and with the
suggestions made, I think we can eliminate a lot of the repetitive
wrappers and get good re-use of code.  From my perspective, the comments
here are saying we can't do what we think we're already doing due to
some subtle difference in terminology that I'm unable to translate into
a practical difference in code.  Perhaps you could point out
specifically what doesn't work in the current proposal or at least stub
out with pseudo code the proposal you'd like to see.  Thanks,

Alex

> > Whether it's a "map" or
> > a "pre-reg", we don't really care, it's just an ioctl specific to the
> > vfio-iommu type with some setup.
> 
> So in this case the operation isn't controlled by the IOMMU type, it's
> controlled by which listener instance we're being called from.  spapr
> tce v2 needs both pre-reg and map operations, but it needs them on
> different address spaces.
> 
> Obviously you can reduce things to one set of listener callbacks if
> you parameterize it enough - I'm just not seeing that it will be
> anything but an unnecessary multiplex if we do.
> 
> > Much of that setup is not specific to
> > the vfio-iommu type and can be shared between guest-iommu type,
> > transparent or visible.  That's the part were I do not want spapr
> > specific setup hiding what should be generic code for anybody that has a
> > guest-visible iommu versus guest-transparent iommu.
> 
> Right, I'm trying to achieve the same goal
> 
> > > So I really don't see any sense in which these can be accomplished by
> > > the same listener.  *Maybe* they could share some region walking code,
> > > but I'm not sure there's going to be anything of significant size here.
> > 
> > We're already sharing a listener for type1 and spapr windows, spapr-v2
> > is a lot like type1, just different ioctls and therefore slightly
> > different setup.
> 
> No, it's not diffent ioctl()s, it's *extra* ioctls().  v2 still needs
> all the same map ioctl()s that v1 does, but it *also* needs pre-reg on
> a different set of regions.
> 
> > > > > The more generally correct approach, which allows for more complex
> > > > > IOMMU arrangements and the possibility of new IOMMU types with pre-reg
> > > > > is:
> > > > >    1. Have the core implement both a mapping listener and a pre-reg
> > > > >       listener (optionally enabled by a per-iommu-type flag).
> > > > >       Basically the first one sees what *is* mapped, the second sees
> > > > >       what *could* be mapped.
> > > > 
> > > > This just sounds like different address spaces, address_space_memory vs
> > > > address_space_physical_memory
> > > 
> > > Um.. what?  I'm not even sure what you mean by
> > > address_space_physical_memory (I see no such thing in the source).
> > 
> > It doesn't exist, it's just my attempt to interpret "*is* mapped" vs
> > "*could* be mapped".
> 
> Um.. ok.  So address_space_memory is already physical address space.
> "is mapped" is the PCI address space - that's true by definition on
> all platforms.  The platform difference is just whether the PCI
> address space is the same as address_space_memory or not.
> 
> "could be mapped" is address_space_memory on all current platforms.  I
> can think of theoretical cases where it wouldn't be, but they're
> pretty contrived.
> 
> > > The idea was that the (top level) pre-reg listener would spawn more
> > > listeners for any AS which could get (partially) mapped into the PCI
> > > addres space.
> > > 
> > > But.. I looked a bit closer and realised this scheme doesn't actually
> > > work.  IOMMU memory regions don't actually have a fixed target AS
> > > property (by which I mean the address space the IOMMU translates
> > > *into* rather than translates from - address_space_memory in most
> > > cases).  Instead any individual IOMMU mapping can point to a different
> > > AS supplied in the IOMMUTLB structure.
> > 
> > Why does that imply separate listeners?
> 
> It doesn't, I'm point out a different problem with the scheme I was
> suggesting.   Essentially what 3&4 were supposed to be doing was
> procedurally determining the "could be mapped" AS in a platform
> independent way.
> 
> > We already do this in the
> > current listener.
> 
> No, not quite.  The current listener looks at the IOMMU's source
> address space - the space of IOVAs.  I'm saying here that pre-reg /
> the "could be mapped" path needs to look at the target address space -
> the space which IO PTEs point into.
> 
> > This all sounds like things that are not unique to
> > spapr, only the actual ioctls are unique.
> 
> I'm not suggesting any spapr uniqueness per se here, the only
> difference I'm envisigaing is presence / absence of pre-reg.  In fact
> even the ioctl()s are the same - or would be assuming any future
> pre-reg IOMMU type re-uses the same pre-reg ioctl()s.
> 
> > > > >    2. As now, the mapping listener listens on PCI address space, if
> > > > >       RAM blocks are added, immediately map them into the host IOMMU,
> > > > >       if guest IOMMU blocks appear register a notifier which will
> > > > >       mirror guest IOMMU mappings to the host IOMMU (this is what we
> > > > >       do now).
> > > > 
> > > > Right, this is done now, nothing new required.
> > > 
> > > Yes, I was just spelling that out for comparison with the other part.
> > > 
> > > > >    3. The pre-reg listener also listens on the PCI address space.  RAM
> > > > >       blocks added are pre-registered immediately.  But, if guest
> > > > >       IOMMU blocks are added, instead of registering a guest-iommu
> > > > >       notifier, we register another listener on the *target* AS of the
> > > > >       guest IOMMU, same callbacks as this one.  In practice that
> > > > >       target AS will almost always resolve to address_space_memory,
> > > > >       but this can at least in theory handle crazy guest setups with
> > > > >       multiple layers of IOMMU.
> > > > > 
> > > > >    4. Have to ensure that the pre-reg callbacks always happen before
> > > > >       the mapping calls.  For a system with an IOMMU backend which
> > > > >       requires pre-registration, but doesn't have a guest IOMMU, we
> > > > >       need to pre-reg, then host-iommu-map RAM blocks that appear in
> > > > >       PCI address space.
> > > 
> > > > Both of these just seem like details of the iommu-type (pre-reg)
> > > > specific parts of the listener, I'm not understanding what's
> > > > fundamentally different about it that we can't do it now, within a
> > > > single listener, nor do I see how it's all that different from x86.
> > > 
> > > So, we have two different address spaces to listen on (PCI and
> > > address_space_memory) so we need to different listener instances
> > > registered, clearly.
> > 
> > Yes, Alexey's current proposal has two listener instances, that's fine.
> > 
> > > Those two different listeners need to do different things.  1) maps
> > > memory blocks into VFIO and sets up a notifier to mirror guest IOMMU
> > > mappings into VFIO.  2) preregisters memory blocks, and it would be a
> > > bug if it ever saw a guest IOMMU.
> > > 
> > > So, what's left to share?
> > 
> > Well, 2) is exactly what type1 needs and 1) seems like what any
> > guest-visible IOMMU needs.  Look at the code we have now, there's some
> > small alignment testing, if iommu, else type1.  Maybe the code can be
> > clarified slightly by duplicating the testing into separate listeners,
> > but unless there's something fundamentally different between type1 and
> > spapr-v2, I think that that the vfio-iommu callouts should just be
> > modularized and the listener shared.  Otherwise we're just splitting the
> > code for convenience at the cost of code sharing and maintenance.
> > Thanks,
> > 
> > Alex
> > 
> 






reply via email to

[Prev in Thread] Current Thread [Next in Thread]