qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration
Date: Sat, 30 Mar 2019 08:14:07 -0600

On Fri, 29 Mar 2019 19:10:50 -0400
Zhao Yan <address@hidden> wrote:

> On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote:
> > On Thu, 28 Mar 2019 22:47:04 -0400
> > Zhao Yan <address@hidden> wrote:
> >   
> > > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote:  
> > > > On Thu, 28 Mar 2019 10:21:38 +0100
> > > > Erik Skultety <address@hidden> wrote:
> > > >     
> > > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote:    
> > > > > > hi Alex and Dave,
> > > > > > Thanks for your replies.
> > > > > > Please see my comments inline.
> > > > > >
> > > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote:    
> > > > > >   
> > > > > > > On Wed, 27 Mar 2019 20:18:54 +0000
> > > > > > > "Dr. David Alan Gilbert" <address@hidden> wrote:
> > > > > > >      
> > > > > > > > * Zhao Yan (address@hidden) wrote:      
> > > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck 
> > > > > > > > > wrote:      
> > > > > > > > > > > > > >   b) How do we detect if we're migrating from/to 
> > > > > > > > > > > > > > the wrong device or
> > > > > > > > > > > > > > version of device?  Or say to a device with older 
> > > > > > > > > > > > > > firmware or perhaps
> > > > > > > > > > > > > > a device that has less device memory ?      
> > > > > > > > > > > > > Actually it's still an open for VFIO migration. Need 
> > > > > > > > > > > > > to think about
> > > > > > > > > > > > > whether it's better to check that in libvirt or qemu 
> > > > > > > > > > > > > (like a device magic
> > > > > > > > > > > > > along with verion ?).      
> > > > > > > > > > >
> > > > > > > > > > > We must keep the hardware generation is the same with one 
> > > > > > > > > > > POD of public cloud
> > > > > > > > > > > providers. But we still think about the live migration 
> > > > > > > > > > > between from the the lower
> > > > > > > > > > > generation of hardware migrated to the higher generation. 
> > > > > > > > > > >      
> > > > > > > > > >
> > > > > > > > > > Agreed, lower->higher is the one direction that might make 
> > > > > > > > > > sense to
> > > > > > > > > > support.
> > > > > > > > > >
> > > > > > > > > > But regardless of that, I think we need to make sure that 
> > > > > > > > > > incompatible
> > > > > > > > > > devices/versions fail directly instead of failing in a 
> > > > > > > > > > subtle, hard to
> > > > > > > > > > debug way. Might be useful to do some initial sanity checks 
> > > > > > > > > > in libvirt
> > > > > > > > > > as well.
> > > > > > > > > >
> > > > > > > > > > How easy is it to obtain that information in a form that 
> > > > > > > > > > can be
> > > > > > > > > > consumed by higher layers? Can we find out the device type 
> > > > > > > > > > at least?
> > > > > > > > > > What about some kind of revision?      
> > > > > > > > > hi Alex and Cornelia
> > > > > > > > > for device compatibility, do you think it's a good idea to 
> > > > > > > > > use "version"
> > > > > > > > > and "device version" fields?
> > > > > > > > >
> > > > > > > > > version field: identify live migration interface's version. 
> > > > > > > > > it can have a
> > > > > > > > > sort of backward compatibility, like target machine's version 
> > > > > > > > > >= source
> > > > > > > > > machine's version. something like that.      
> > > > > > >
> > > > > > > Don't we essentially already have this via the device specific 
> > > > > > > region?
> > > > > > > The struct vfio_info_cap_header includes id and version fields, 
> > > > > > > so we
> > > > > > > can declare a migration id and increment the version for any
> > > > > > > incompatible changes to the protocol.      
> > > > > > yes, good idea!
> > > > > > so, what about declaring below new cap?
> > > > > >     #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > > > >     struct vfio_region_info_cap_migration {
> > > > > >         struct vfio_info_cap_header header;
> > > > > >         __u32 device_version_len;
> > > > > >         __u8  device_version[];
> > > > > >     };    
> > > > 
> > > > I'm not sure why we need a new region for everything, it seems this
> > > > could fit within the protocol of a single region.  This could simply be
> > > > a new action to retrieve the version where the protocol would report
> > > > the number of bytes available, just like the migration stream itself.   
> > > >  
> > > so, to get version of VFIO live migration device state interface (simply
> > > call it migration interface?),
> > > a new cap looks like this:
> > > #define VFIO_REGION_INFO_CAP_MIGRATION 4
> > > it contains struct vfio_info_cap_header only.
> > > when get region info of the migration region, we query this cap and get
> > > migration interface's version. right?
> > > 
> > > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine?  
> > 
> > Again, why a new region.  I'm imagining we have one region and this is
> > just asking for a slightly different thing from it.  But TBH, I'm not
> > sure we need it at all vs the sysfs interface.
> >   
> > > > > > > > > device_version field consists two parts:
> > > > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086.      
> > > > > > >
> > > > > > > Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
> > > > > > > suggest we use a bit to flag it as such so we can reserve that 
> > > > > > > portion
> > > > > > > of the 32bit address space.  See for example:
> > > > > > >
> > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE        (1 << 31)
> > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK        (0xffff)
> > > > > > >
> > > > > > > For vendor specific regions.      
> > > > > > Yes, use PCI vendor ID.
> > > > > > you are right, we need to use highest bit 
> > > > > > (VFIO_REGION_TYPE_PCI_VENDOR_TYPE)
> > > > > > to identify it's a PCI ID.
> > > > > > Thanks for pointing it out.
> > > > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK 
> > > > > > used for?
> > > > > > why it's 0xffff? I searched QEMU and kernel code and did not find 
> > > > > > anywhere
> > > > > > uses it.    
> > > > 
> > > > PCI vendor IDs are 16bits, it's just indicating that when the
> > > > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits.    
> > > 
> > > thanks:)
> > >   
> > > > > > > > > 2. vendor proprietary string: it can be any string that a 
> > > > > > > > > vendor driver
> > > > > > > > > thinks can identify a source device. e.g. pciid + mdev type.
> > > > > > > > > "vendor id" is to avoid overlap of "vendor proprietary 
> > > > > > > > > string".
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > struct vfio_device_state_ctl {
> > > > > > > > >      __u32 version;            /* ro */
> > > > > > > > >      __u8 device_version[MAX_DEVICE_VERSION_LEN];            
> > > > > > > > > /* ro */
> > > > > > > > >      struct {
> > > > > > > > >       __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > > > > > > > >       ...
> > > > > > > > >      }data;
> > > > > > > > >      ...
> > > > > > > > >  };      
> > > > > > >
> > > > > > > We have a buffer area where we can read and write data from the 
> > > > > > > vendor
> > > > > > > driver, why would we have this IS_COMPATIBLE protocol to write the
> > > > > > > device version string but use a static fixed length version 
> > > > > > > string in
> > > > > > > the control header to read it?  IOW, let's use GET_VERSION,
> > > > > > > CHECK_VERSION actions that make use of the buffer area and allow 
> > > > > > > vendor
> > > > > > > specific version information length.      
> > > > > > you are right, such static fixed length version string is bad :)
> > > > > > To get device version, do you think which approach below is better?
> > > > > > 1. use GET_VERSION action, and read from region buffer
> > > > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION
> > > > > >
> > > > > > seems approach 1 is better, and cap VFIO_REGION_INFO_CAP_MIGRATION 
> > > > > > is only
> > > > > > for checking migration interface's version?    
> > > > 
> > > > I think 1 provides the most flexibility to the vendor driver.    
> > > 
> > > Got it.
> > > For VFIO live migration, compared to reuse device state region (which 
> > > takes
> > > GET_BUFFER/SET_BUFFER actions),
> > > could we create a new region for GET_VERSION & CHECK_VERSION ?  
> > 
> > Why?
> >   
> > > > > > > > > Then, an action IS_COMPATIBLE is added to check device 
> > > > > > > > > compatibility.
> > > > > > > > >
> > > > > > > > > The flow to figure out whether a source device is migratable 
> > > > > > > > > to target device
> > > > > > > > > is like that:
> > > > > > > > > 1. in source side's .save_setup, save source device's 
> > > > > > > > > device_version string
> > > > > > > > > 2. in target side's .load_state, load source device's device 
> > > > > > > > > version string
> > > > > > > > > and write it to data region, and call IS_COMPATIBLE action to 
> > > > > > > > > ask vendor driver
> > > > > > > > > to check whether the source device is compatible to it.
> > > > > > > > >
> > > > > > > > > The advantage of adding an IS_COMPATIBLE action is that, 
> > > > > > > > > vendor driver can
> > > > > > > > > maintain a compatibility table and decide whether source 
> > > > > > > > > device is compatible
> > > > > > > > > to target device according to its proprietary table.
> > > > > > > > > In device_version string, vendor driver only has to describe 
> > > > > > > > > the source
> > > > > > > > > device as elaborately as possible and resorts to vendor 
> > > > > > > > > driver in target side
> > > > > > > > > to figure out whether they are compatible.      
> > > > > > >
> > > > > > > I agree, it's too complicated and restrictive to try to create an
> > > > > > > interface for the user to determine compatibility, let the driver
> > > > > > > declare it compatible or not.      
> > > > > > :)
> > > > > >      
> > > > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow 
> > > > > > > > callable
> > > > > > > > externally - so we could be able to answer a question like 'can 
> > > > > > > > we
> > > > > > > > migrate this VM to this host' - from the management layer 
> > > > > > > > before it
> > > > > > > > actually starts the migration.      
> > > > > >
> > > > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and 
> > > > > > CHECK_VERSION.
> > > > > > GET_VERSION returns a vm's device's version string.
> > > > > > CHECK_VERSION's input is device version string and return
> > > > > > compatible/non-compatible.
> > > > > > Do you think it's good?    
> > > > 
> > > > That's the idea, but note that QEMU can only provide the QMP interface,
> > > > the sysfs interface would of course be provided as more of a direct
> > > > path from the vendor driver or mdev kernel layer.
> > > >     
> > > > > > > I think we'd need to mirror this capability in sysfs to support 
> > > > > > > that,
> > > > > > > or create a qmp interface through QEMU that the device owner 
> > > > > > > could make
> > > > > > > the request on behalf of the management layer.  Getting access to 
> > > > > > > the
> > > > > > > vfio device requires an iommu context that's already in use by the
> > > > > > > device owner, we have no intention of supporting a model that 
> > > > > > > allows
> > > > > > > independent tasks access to a device.  Thanks,
> > > > > > > Alex
> > > > > > >      
> > > > > > do you think two sysfs nodes under a device node is ok?
> > > > > > e.g.
> > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version
> > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version
> > > > > >       
> > > > 
> > > > I'd think it might live more in the mdev_support_types area, wouldn't
> > > > we ideally like to know if a device is compatible even before it's
> > > > created?  For example maybe:
> > > > 
> > > > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version
> > > > 
> > > > Where reading the sysfs attribute returns the version string and
> > > > writing a string into the attribute return an errno for 
> > > > incompatibility.    
> > > yes, knowing if a device is compatible before it's created is good.
> > > but do you think check whether a device is compatible after it's created 
> > > is
> > > also required? For live migration, user usually only queries this 
> > > information
> > > when it's really required, i.e. when a device has been created.
> > > maybe we can add this version get/check at both places?  
> > 
> > Why does an instantiated device suddenly not follow the version and
> > compatibility rules of an uninstantiated device?  IOW, if the version
> > and compatibility check are on the mdev type, why can't we trace back
> > from the device to the mdev type and make use of that same interface?
> > Seems the only question is whether we require an interface through the
> > vfio API directly or if sysfs is sufficient.  
> ok. got it.
> 
> > > > > Why do you need both sysfs and QMP at the same time? I can see it 
> > > > > gives us some
> > > > > flexibility, but is there something more to that?
> > > > >
> > > > > Normally, I'd prefer a QMP interface from libvirt's perspective (with 
> > > > > an
> > > > > appropriate capability that libvirt can check for QEMU support) 
> > > > > because I imagine large nodes having a
> > > > > bunch of GPUs with different revisions which might not be backwards 
> > > > > compatible.
> > > > > Libvirt might query the version string on source and check it on dest 
> > > > > via the
> > > > > QMP in a way that QEMU, talking to the driver, would return either a 
> > > > > list or a
> > > > > single physical device to which we can migrate, because neither QEMU 
> > > > > nor
> > > > > libvirt know that, only the driver does, so that's an important 
> > > > > information
> > > > > rather than looping through all the devices and trying to find one 
> > > > > that is
> > > > > compatible. However, you might have a hard time making all the 
> > > > > necessary
> > > > > changes in QMP introspectable, a new command would be fine, but if 
> > > > > you also
> > > > > wanted to extend say vfio-pci options, IIRC that would not appear in 
> > > > > the QAPI
> > > > > schema and libvirt would not be able to detect support for it.
> > > > > 
> > > > > On the other hand, the presence of a QMP interface IMO doesn't help 
> > > > > mgmt apps
> > > > > much, as it still carries the burden of being able to check this only 
> > > > > at the
> > > > > time of migration, which e.g. OpenStack would like to know long 
> > > > > before that.
> > > > > 
> > > > > So, having sysfs attributes would work for both libvirt (even though 
> > > > > libvirt
> > > > > would benefit from a QMP much more) and OpenStack. OpenStack would 
> > > > > IMO then
> > > > > have to figure out how to create the mappings between compatible 
> > > > > devices across
> > > > > several nodes which are non-uniform.    
> > > > 
> > > > Yep, vfio encompasses more than just QEMU, so a sysfs interface has more
> > > > utility than a QMP interface.  For instance we couldn't predetermine if
> > > > an mdev type on a host is compatible if we need to first create the
> > > > device and launch a QEMU instance on it to get access to QMP.  So maybe
> > > > the question is whether we should bother with any sort of VFIO API to
> > > > do this comparison, perhaps only a sysfs interface is sufficient for a
> > > > complete solution.  The downside of not having a version API in the
> > > > user interface might be that QEMU on its own can only try a migration
> > > > and see if it fails, it wouldn't have the ability to test expected
> > > > compatibility without access to sysfs.  And maybe that's fine.  Thanks,
> > > >     
> > > So QEMU vfio uses sysfs to check device compatiblity in migration's 
> > > save_setup
> > > phase?  
> > 
> > The migration stream between source and target device are the ultimate
> > test of compatibility, the vendor driver should never rely on userspace
> > validating compatibility of the migration.  At the point it could do so, the
> > migration has already begun, so we're only testing how quickly we can
> > fail the migration.  The management layer setting up the migration can
> > test via sysfs for compatibility and the migration stream itself needs
> > to be self validating, so what value is added for QEMU to perform a
> > version compatibility test?  Thanks,  
> oh, do you mean vendor driver should embed source device's version in 
> migration
> stream, which is opaque to qemu?
> otherwise, I can't think of a quick way for vendor driver to determine whether
> source device is an incompatible device.  

Yes, the vendor driver cannot rely on the user to make sure the
incoming migration stream is compatible, the vendor driver must take
responsibility for this.  Therefore, regardless of what other
interfaces we have for the user to test the compatibility between
devices, the vendor driver must make no assumptions about the validity
or integrity of the data stream.  Plan for and protect against a
malicious or incompetent user.  Thanks,

Alex



reply via email to

[Prev in Thread] Current Thread [Next in Thread]