qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: use VFIO over a UNIX domain socket to implement device offloadi


From: Alex Williamson
Subject: Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
Date: Tue, 2 Jun 2020 09:06:29 -0600

On Wed, 20 May 2020 17:45:13 -0700
John G Johnson <john.g.johnson@oracle.com> wrote:

> > I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> > The former seems intended to provide the server with access to the
> > entire GPA space, while the latter indicates an IOVA to GPA mapping of
> > those regions.  Doesn't this break the basic isolation of a vIOMMU?
> > This essentially says to me "here's all the guest memory, but please
> > only access these regions for which we're providing DMA mappings".
> > That invites abuse.
> >   
> 
>       The purpose behind separating QEMU into multiple processes is
> to provide an additional layer protection for the infrastructure against
> a malign guest, not for the guest against itself, so preventing a server
> that has been compromised by a guest from accessing all of guest memory
> adds no additional benefit.  We don’t even have an IOMMU in our current
> guest model for this reason.

One of the use cases we see a lot with vfio is nested assignment, ie.
we assign a device to a VM where the VM includes a vIOMMU, such that
the guest OS can then assign the device to userspace within the guest.
This is safe to do AND provides isolation within the guest exactly
because the device only has access to memory mapped to the device, not
the entire guest address space.  I don't think it's just the hypervisor
you're trying to protect, we can't assume there are always trusted
drivers managing the device.

> 
>       The implementation was stolen from vhost-user, with the exception
> that we push IOTLB translations from client to server like VFIO does, as
> opposed to pulling them from server to client like vhost-user does.

It seems that vhost has numerous hacks forcing it to know whether a
vIOMMU is present as a result of this, vfio has none.
 
>       That said, neither the qemu-mp nor MUSER implementation uses an
> IOMMU, so if you prefer another IOMMU model, we can consider it.  We
> could only send the guest memory file descriptors with IOMMU_MAP_DMA
> requests, although that would cost performance since each request would
> require the server to execute an mmap() system call.

It would seem shortsighted to not fully enable a vIOMMU compatible
implementation at this time.

> > Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
> > "an array of file descriptors will be sent as part of the message
> > meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
> > the table specified as 0?  How does a client learn their Device ID?
> >   
> 
>       SCM_RIGHTS message controls allow sendmsg() to send an array of
> file descriptors over a UNIX domain socket.
> 
>       We’re only supporting one device per socket in this protocol
> version, so the device ID will always be 0.  This may change in a future
> revision, so we included the field in the header to avoid a major version
> change if device multiplexing is added later.
> 
> 
> > VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
> > capability chain), the cap_offset and next pointers within the chain
> > need to specify what their offset is relative to (ie. the start of the
> > packet, the start of the vfio compatible data structure, etc).  I
> > assume the latter for client compatibility.
> >   
> 
>       Yes.  We will attempt to make the language clearer.
> 
> 
> > Also on REGION_INFO, offset is specified as "the base offset to be
> > given to the mmap() call for regions with the MMAP attribute".  Base
> > offset from what?  Is the mmap performed on the socket fd?  Do we not
> > allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
> > Why do we specify "MMIO" in those operations versus simply "REGION"?
> > Are we arbitrarily excluding support for I/O port regions or device
> > specific regions?  If these commands replace direct read and write to
> > an fd offset, how is PCI config space handled?
> >   
> 
>       The base offset refers to the sparse areas, where the sparse area
> offset is added to the base region offset.  We will try to make the text
> clearer here as well.
> 
>       MMIO was added to distinguish these operations from DMA operations.
> I can see how this can cause confusion when the region refers to a port range,
> so we can change the name to REGION_READ/WRITE. 
> 
> 
> > VFIO_USER_MMIO_READ specifies the count field is zero and the reply
> > will include the count specifying the amount of data read.  How does
> > the client specify how much data to read?  Via message size?
> >   
> 
>       This is a bug in the doc.  As you said, the read field should
> be the amount of data to be read.
>       
> 
> > VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
> > should only ever have access via IOVA, which implies a DMA mapping
> > exists for the device.  Can you provide an example of why we need these
> > commands since there seems little point to this interface if a device
> > cannot directly interact with VM memory.
> >   
> 
>       It is a GPA.  The device emulation code would only handle the DMA
> addresses the guest programmed it with; the server infrastructure knows
> whether an IOMMU exists, and whether the DMA address needs translation to
> GPA or not.

I'll re-iterate, a device should only ever issue DMAs in terms of IOVA.
This is how vfio works.

> > The IOMMU commands should be unnecessary, a vIOMMU should be
> > transparent to the server by virtue that the device only knows about
> > IOVA mappings accessible to the device.  Requiring the client to expose
> > all memory to the server implies that the server must always be trusted.
> >   
> 
>       The client and server are equally trusted; the guest is the untrusted
> entity.

And therefore the driver is untrusted and opening the client/sever
window to expose all of guest memory presents a larger attack surface.

> > Interrupt info format, s/type/index/, s/vector/subindex/
> >   
> 
>       ok
> 
> 
> > In addition to the unused ioctls, the entire concept of groups and
> > containers are not found in this specification.  To some degree that
> > makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
> > to group relationship.  However, the container is very much involved in
> > the development of migration support, where it's the container that
> > provides dirty bitmaps.  Since we're doing map and unmap without that
> > container concept here, perhaps we'd equally apply those APIs to this
> > same socket.  Thanks,  
> 
>       Groups and containers are host IOMMU concepts, and we don’t
> interact with the host here.  The kernel VFIO driver doesn’t even need
> to exist for VFIO over socket.  I think it’s fine to assume a 1-1
> correspondence between containers, groups, and a VFIO over socket device.

Obviously the kernel driver and host IOMMU are out of the picture here.
The point I was trying to make is that we're building interfaces to
support migration around concepts that don't exist in this model, so
it's not clear how we'd map, for example, dirty page tracking on the
container interface to this API.  This seems more akin to the no-iommu
model of vfio, which is a hack where we allow userspace to have access
to a device using the vfio API, but they're on their own for DMA.  We
don't support that model in QEMU, and without those conceptual
equivalencies, I wonder how much we'll be able to leverage existing
QEMU code or co-develop and support future features.  IOW, is this
really just "a vfio-like device model over unix socket" rather than
"vfio over unix socket"?  Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]