qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu


From: Alex Williamson
Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu
Date: Tue, 31 Jul 2012 10:56:57 -0600

On Tue, 2012-07-31 at 15:34 +0300, Avi Kivity wrote:
> On 07/31/2012 01:29 AM, Alex Williamson wrote:
> >> 
> >> If the region size is zero, then both memory_region_del_subregion()
> >> (assuming the region is parented) and munmap() do nothing.  So you could
> >> call this unconditionally.
> > 
> > I suppose parenting them is the key.  I'm counting on memory_region_size
> > of zero for an uninitialized, g_malloc0() MemoryRegion.
> 
> That's a no-no.  We have APIs for a reason.  Maybe I'll start encrypting
> the contents by xoring with a private variable.

Heh.  Ok, I thought it was part of the API.  I'll refrain from using
that test on uninitialized MemoryRegions.

> >  Initializing
> > them just to have a parent so we can unconditionally remove them here
> > seems like it's just shifting complexity from one function to another.
> > The majority of BARs aren't even implemented, so we'd actually be
> > setting up a lot of dummy infrastructure for a slightly cleaner unmap
> > function.  I'll keep looking at this, but I'm not optimistic there's an
> > overall simplification here.
> 
> Ok.  But use your own bool, don't overload an something from MemoryRegion.

Yup.

> >  
> >> >> > +
> >> >> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> >> >> > +        size = memory_region_size(&vdev->msix->mmap_mem);
> >> >> > +        if (size) {
> >> >> > +            memory_region_del_subregion(&bar->mem, 
> >> >> > &vdev->msix->mmap_mem);
> >> >> > +            munmap(vdev->msix->mmap, size);
> >> >> > +        }
> >> >> > +    }
> >> > 
> >> > And this one potentially unmaps the overlap after the vector table if
> >> > there's any space for one.
> >> > 
> >> >> Are the three size checks needed? Everything should work without them
> >> >> from the memory core point of view.
> >> > 
> >> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
> >> > NULL... no?
> >> 
> >> NULL isn't the problem (well some kernels protect against mmaping NULL
> >> to avoid kernel exploits), but it seems the kernel doesn't like a zero
> >> length.
> > 
> > in mm/mmap.c:do_munmap() I see:
> > 
> >         if ((len = PAGE_ALIGN(len)) == 0)
> >                 return -EINVAL;
> > 
> > Before anything scary happens, so that should be ok.  It's not really
> > worthwhile to call the munmaps unconditionally if we already have the
> > condition tests because the subregions are unparented though.
> 
> Yeah.
> 
> > 
> >> >> > +
> >> >> > +    /*
> >> >> > +     * We can't mmap areas overlapping the MSIX vector table, so we
> >> >> > +     * potentially insert a direct-mapped subregion before and after 
> >> >> > it.
> >> >> > +     */
> >> >> 
> >> >> This splitting is what the memory core really enjoys.  You can just
> >> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
> >> > 
> >> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
> >> > reasons.  It might be worthwhile to someday make VFIO insert an
> >> > anonymous page over the MSI-X table to allow this, but it didn't look
> >> > trivial for my novice mm abilities.  Easy to add a flag from the VFIO
> >> > kernel structure where we learn about this BAR if we add it in the
> >> > future.
> >> 
> >> I meant due it purely in qemu.  Instead of an emulated region overlaid
> >> by two assigned regions, have an assigned region overlaid by the
> >> emulated region.  The regions seen by the vfio listener will be the same.
> > 
> > Sure, that's what KVM device assignment does, but it requires being able
> > to mmap the whole BAR, including an MSI-X table.  The VFIO kernel side
> > can't assume userspace isn't malicious so it has to prevent this.
> 
> I wonder whether it should prevent the mmap(), or let it though and just
> SIGBUS on accesses.

That's a good idea too, maybe better than just an anonymous page so the
user knows the access doesn't get to hardware.  This is definitely an
improvement I'd like to see as we go.

> >> > 
> >> > This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
> >> > an instance of a container in the kernel.  A group can only be attached
> >> > to one container.  So whoever calls us with passed fds needs to track
> >> > this very carefully.  This is also why I've dropped any kind of shared
> >> > IOMMU option to give us a hint whether to try to cram everything in the
> >> > same container (~= iommu domain).  It's too easy to pass conflicting
> >> > info to share a container for one device, but not another... yet they
> >> > may be in the same group.  I'll work on the fd passing though and try to
> >> > come up with a reasonable model.
> >> 
> >> I didn't really follow the container stuff so I can't comment here.  But
> >> suppose all assigned devices are done via fd passing, isn't it
> >> sufficient to just pass the fd for the device (and keep the iommu group
> >> fd in the managment tool)?
> > 
> > Nope.
> > 
> > containerfd = open(/dev/vfio/vfio)
> > groupfd = open(/dev/vfio/$GROUPID)
> > devicefd  = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)
> > 
> > The container provides access to the iommu, the group is the unit of
> > ownership and privilege, and device cannot be accessed without iommu
> > protection.  Therefore to get to a devicefd, we first need to privilege
> > the container by attaching a group to it, that let's us initialize the
> > iommu, which allows us to get the device fd.  At a minimum, we'd need
> > both container and device fds, which means libvirt would be responsible
> > for determining what type of iommu interface to initialize.  Doing that
> > makes adding another device tenuous.  It's not impossible, but VFIO is
> > design such that /dev/vfio/vfio is completely harmless on it's own, safe
> > for mode 0666 access, just like /dev/kvm.  The groupfd is the important
> > access point, so maybe it's sufficient that libvirt could pass only that
> > and let qemu open /dev/vfio/vfio on it's own.  The only problem then is
> > that libvirt needs to pass the same groupfd for each device that gets
> > assigned within a group.
> 
> What I was thinking was that libvirt would do all the setup, including
> attaching the iommu, then pass something that is safe to qemu.  I don't
> see an issue with libvirt keeping tracks of groups; libvirt is supposed
> to be doing the host-side management anyway.  But I'm not familiar with
> the API so I guess it can't be done.  Maybe an extension?

It can be done, I think it will just be challenging to keep qemu and
libvirt in sync.  It means that libvirt has to know which iommu model to
use, when to create new containers, when to re-use old containers, etc.
For each qemu -device vfio-pci we'd need a containerfd and a devicefd
where the containerfd may or may not be the same as used for other
devices.  I've also been thinking that it would be nice if there was a
reset ioctl on the group to make it easier to handle things like p2p
bridges.  That would mean passing a groupfd as well, but then if libvirt
provides that, there's not much point in handing us the device fd
because it's not offing any protection (we can get it ourselves), which
also then makes passing the containerfd awkward because libvirt would
have to assume a 1:1 container:group model.  Passing the devicefd really
never offers protection because we know it's not isolated from other
device in the group.  I guess I'm back to libvirt should only pass the
groupfd and let qemu handle the rest.

> >> >> > +
> >> >> > +
> >> >> > +typedef struct MSIVector {
> >> >> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> >> >> > +    struct VFIODevice *vdev; /* back pointer to device */
> >> >> > +    int vector; /* the vector number for this element */
> >> >> > +    int virq; /* KVM irqchip route for Qemu bypass */
> >> >> 
> >> >> This calls for an abstraction (don't we have a cache where we look those
> >> >> up?)
> >> > 
> >> > I haven't see one, pointer?  I tried to follow vhost's lead here.
> >> 
> >> See kvm_irqchip_send_msi().  But this isn't integrated with irqfd yet.
> > 
> > Right, the irqfd is what we're really after.
> 
> Ok, I guess both vhost and vfio could use a qemu_irq_eventfd() which
> creates an irqfd if available, or emulates it by adding a listener to
> that eventfd and injecting the interrupt (either through tcg or kvm) itself.

Yeah, that would be a nice simplification.
  
> >> >> > +    bool use;
> >> >> > +} MSIVector;
> >> >> > +
> >> >> > +
> >> >> > +typedef struct VFIOContainer {
> >> >> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >> >> > +    struct {
> >> >> > +        /* enable abstraction to support various iommu backends */
> >> >> > +        union {
> >> >> > +            MemoryListener listener; /* Used by type1 iommu */
> >> >> > +        };
> >> >> 
> >> >> The usual was is to have a Type1VFIOContainer deriving from
> >> >> VFIOContainer and adding a MemoryListener.
> >> > 
> >> > Yep, that would work too.  It gets a bit more complicated that way
> >> > though because we have to know when the container is allocated what type
> >> > it's going to be.  This way we can step though possible iommu types and
> >> > support the right one.  Eventually there may be more than one type
> >> > supported on the same platform (ex. one that enables PRI).  Do-able, but
> >> > I'm not sure it's worth it at this point.
> >> 
> >> An alternative alternative is to put a pointer to an abstract type here,
> >> then you can defer the decision on the concrete type later.  But I agree
> >> it's not worth it at this point.  Maybe just drop the union and decide
> >> later when a second iommu type is added.
> > 
> > A pointer doesn't allow us to use container_of to get back to the
> > VFIOContainer from the memory listener callback, so we'd have to create
> > some new struct just to hold that back pointer.  Alexey's proposed POWER
> > support for VFIO already makes use of the union, so it seems like a
> > sufficient solution for now.  We'll have to re-evaluate if it's getting
> > unwieldy after we get a few though.
> 
> Ok.

Thanks!

Alex






reply via email to

[Prev in Thread] Current Thread [Next in Thread]