Re: VFIO Migration

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VFIO Migration

From:	Stefan Hajnoczi
Subject:	Re: VFIO Migration
Date:	Tue, 3 Nov 2020 11:03:24 +0000
On Mon, Nov 02, 2020 at 12:38:23PM -0700, Alex Williamson wrote:
> 
> Cc+ Intel folks as this really bumps into the migration compatibility
> discussion[1][2][3]
> 
> On Mon, 2 Nov 2020 11:11:53 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > There is discussion about VFIO migration in the "Re: Out-of-Process
> > Device Emulation session at KVM Forum 2020" thread. The current status
> > is that Kirti proposed a VFIO device region type for saving and loading
> > device state. There is currently no guidance on migrating between
> > different device versions or device implementations from different
> > vendors. This is known to be non-trivial and raised discussion about
> > whether it should really be handled by VFIO or centralized in QEMU.
> > 
> > Below is a document that describes how to ensure migration compatibility
> > in VFIO. It does not require changes to the VFIO migration interface. It
> > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> > 
> > The idea is that the device state blob is opaque to the VMM but the same
> > level of migration compatibility that exists today is still available.
> > 
> > I hope this will help us reach consensus and let us discuss specifics.
> > 
> > If you followed the previous discussion, I changed the approach from
> > sending a magic constant in the device state blob to identifying device
> > models by URIs. Therefore the device state structure does not need to be
> > defined here - the critical information for ensuring device migration
> > compatibility is the device model and configuration defined below.
> > 
> > Stefan
> > ---
> > VFIO Migration
> > ==============
> > This document describes how to save and load VFIO device states. Saving a
> > device state produces a snapshot of a VFIO device's state that can be loaded
> > again at a later point in time to resume the device from the snapshot.
> > 
> > The data representation of the device state is outside the scope of this
> > document.
> > 
> > Overview
> > --------
> > The purpose of device states is to save the device at a point in time and 
> > then
> > restore the device back to the saved state later. This is more challenging 
> > than
> > it first appears.
> > 
> > The process of saving a device state and loading it later is called
> > *migration*. The state may be loaded by the same device that saved it or by 
> > a
> > new instance of the device, possibly running on a different computer.
> > 
> > It must be possible to migrate to a newer implementation of the device
> > as well as to an older implementation of the device. This allows users
> > to upgrade and roll back their systems.
> 
> 
> It must be possible to specify, but we can't necessarily force a vendor
> to support it.

The wording is unclear. This migration scheme makes it possible but does
not require implementations to support advanced migration scenarios. A
VFIO/mdev driver or vfio-user device backend can refuse to instantiate
with certain device configuration parameters. For example, if version=1
is no longer supported in the latest device implementation then it can
return an error.

> It must also be possible to describe incompatibilities,
> whether due to lack of support or forks in the migration format.

Compatibility is handled by the device model and configuration
parameters that are used to instantiate devices. If device model URIs
differ then the devices are incompatible (e.g.
https://vendor-a.com/rtl8139 and https://vendor-b.com/rtl8139). When
changes are made to the guest-visible hardware interface or device state
representation then they are toggled via configuration parameters (e.g.
rss=on|off).

Here is an example:

The device model is a network card as defined by the
https://vendor-a.com/my-nic device model. Receive Side Scaling (RSS) is
an optional feature and the configuration parameter rss=on|off toggles
its availability. When rss=on the RSS feature is available in the
hardware interface, but it doesn't necessarily mean that the guest
driver has to enable the feature.

Now we wish to migrate to another implementation of the same device
model. On the destination machine RSS is not available, so trying to
instantiate https://vendor-a.com/my-nic with rss=on will fail with an
error because the feature is unavailable (this could be because the
implementation doesn't support the feature or because the host lacks the
capability).

The following combinations are possible:

Source    Available   Result
          on Dest?
-------------------------------------------------------------------
rss=off          no   OK.
rss=off         yes   OK. rss=on is supported but we don't need it.
rss=on           no   FAIL. rss=on is not supported on destination!
rss=on          yes   OK.

By the way, this shows why this scheme is a conservative bound on
migration compatibility. If the guest driver hasn't enabled RSS and
won't be using it then we could potentially migrate rss=on even when the
destination does not support rss=on. But doing this reliably isn't
tractable so instead we use strict migration compatibility.

Regarding forking, if you want complete freedom you can pick a new
device model URI. Device instances using the old device model URI are
not considered compatible with the new device model URI. However, you
can then introduce changes to the hardware interface or device state
representation without agreement from the owner of the old device model
URI.

If instead you want to collaborate you can agree on changes with the
device model URI owner. You can change the device's hardware interface
and device state representation as described in this document.
Basically, each change must be reflect in a device configuration
parameter.

> > Migration can fail if loading the device state is not possible. It should 
> > fail
> > early with a clear error message. It must not appear to complete but leave 
> > the
> > device inoperable due to a migration problem.
> > 
> > The rest of this document describes how these requirements can be met.
> > 
> > Device Models
> > -------------
> > Devices have a *hardware interface* consisting of hardware registers,
> > interrupts, and so on.
> > 
> > The hardware interface together with the device state representation is 
> > called
> > a *device model*. Device models can be assigned URIs such as
> > https://qemu.org/devices/e1000e to uniquely identify them.
> > 
> > Multiple implementations of a device model may exist. They are they are
> > interchangeable if they follow the same hardware interface and device
> > state representation.
> > 
> > Multiple implementations of the same hardware interface may exist with
> > different device state representations, in which case the device models are 
> > not
> > interchangeable and must be assigned different URIs.
> > 
> > Migration is only possible when the same device model is supported by the
> > *source* and the *destination* devices.
> > 
> > Device Configuration
> > --------------------
> > Device models may have parameters that affect the hardware interface or 
> > device
> > state representation. For example, a network card may have a configurable
> > address filtering table size parameter called ``rx-filter-size``. A
> > device state saved with ``rx-filter-size=32`` cannot be safely loaded
> > into a device with ``rx-filter-size=0``, because changing the size from
> > 32 to 0 may disrupt device operation.
> > 
> > A list of configuration parameters is called the *device configuration*.
> > Migration is expected to succeed when the same device model and 
> > configuration
> > that was used for saving the device state is used again to load it.
> > 
> > Note that not all parameters used to instantiate a device need to be part of
> > the device configuration. For example, assigning a network card to a 
> > specific
> > physical port is not part of the device configuration since it is not part 
> > of
> > the device's hardware interface or the device state representation. The 
> > device
> > state can be loaded and run on a different physical port without affecting 
> > the
> > operation of the device. Therefore the physical port is not part of the 
> > device
> > configuration.
> > 
> > However, secondary aspects related to the physical port may affect the 
> > device's
> > hardware interface and need to be reflected in the device configuration. The
> > link speed may depend on the physical port and be reported through the 
> > device's
> > hardware interface. In that case a ``link-speed`` configuration parameter is
> > required to prevent unexpected changes to the link speed after migration.
> > 
> > Note that the device configuration is a conservative bound on device
> > states that can be migrated successfully since not all configuration
> > parameters may be strictly required to match on the source and
> > destination devices. For example, if the device's hardware interface has
> > not yet been initialized then changes to the link speed may not be
> > noticed. However, accurately representing runtime constraints is complex
> > and risks introducing migration bugs, so no attempt is made to support
> > them to achieve more relaxed bounds on successful migrations.
> > 
> > Device Versions
> > ---------------
> > As a device evolves, the number of configuration parameters required may 
> > become
> > inconvenient for users to express in full. A device configuration can be
> > aliased by a *device version*, which is a shorthand for the full device
> > configuration. This makes it easy to apply a standard device configuration
> > without listing every configuration parameter explicitly.
> > 
> > For example, if address filtering support was added to a network card then
> > device versions and the corresponding configurations may look like this:
> > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > * ``version=2`` - ``rx-filter-size=32``
> > 
> > Device States
> > -------------
> > The details of the device state representation are not covered in this 
> > document
> > but the general requirements are discussed here.
> > 
> > The device state consists of data accessible through the device's hardware
> > interface and internal state that is needed to restore device operation.
> > State in the hardware interface includes the values of hardware registers.
> > An example of internal state is an index value needed to avoid processing
> > queued requests more than once.
> > 
> > Changes can be made to the device state representation as follows. Each 
> > change
> > to device state must have a corresponding device configuration parameter 
> > that
> > allows the change to toggled:
> > 
> > * When the parameter is disabled the hardware interface and device state
> >   representation are unchanged. This allows old device states to be loaded.
> > 
> > * When the parameter is enabled the change comes into effect.
> > 
> > * The parameter's default value disables the change. Therefore old versions 
> > do
> >   not have to explicitly specify the parameter.
> > 
> > The following example illustrates migration from an old device
> > implementation to a new one. A version=1 network card is migrated to a
> > new device implementation that is also capable of version=2 and adds the
> > rx-filter-size=32 parameter. The new device is instantiated with
> > version=1, which disables rx-filter-size and is capable of loading the
> > version=1 device state. The migration completes successfully but note
> > the device is still operating at version=1 level in the new device.
> > 
> > The following example illustrates migration from a new device
> > implementation back to an older one. The new device implementation
> > supports version=1 and version=2. The old device implementation supports
> > version=1 only. Therefore the device can only be migrated when
> > instantiated with version=1 or the equivalent full configuration
> > parameters.
> > 
> > Orchestrating Migrations
> > ------------------------
> > The following steps must be followed to migrate devices:
> > 
> > 1. Check that the source and destination devices support the same device 
> > model.
> > 
> > 2. Check that the destination device supports the source device's
> >    configuration. Each configuration parameter must be accepted by the
> >    destination in order to ensure that it will be possible to load the 
> > device
> >    state.
> > 
> > 3. The device state is saved on the source and loaded on the destination.
> > 
> > 4. If migration succeeds then the destination resumes operation and the 
> > source
> >    must not resume operation. If the migration fails then the source resumes
> >    operation and the destination must not resume operation.
> > 
> > VFIO Implementation
> > -------------------
> > The following applies both to kernel VFIO/mdev drivers and vfio-user device
> > backends.
> > 
> > Devices are instantiated based on a version and/or configuration parameters:
> > * ``version=1`` - use the device configuration aliased by version 1
> > * ``version=2,rx-filter-size=64`` - use version 1 and override 
> > ``rx-filter-size``
> > * ``rx-filter-size=0`` - directly set configuration parameters without 
> > using a version
> > 
> > Device creation fails if the version and/or configuration parameters are not
> > supported.
> > 
> > There must be a mechanism to query the "latest" configuration for a device
> > model. It may simply report the ``version=5`` where 5 is the latest version 
> > but
> > it could also report all configuration parameters instead of using a version
> > alias.
> 
> When we talk about "instantiating" a device here, are we referring to
> managing the device on the host or within QEMU via something like
> vfio_realize()?  We create an instance of an mdev on the host via an
> mdev type using operations on the host sysfs.  That mdev type doesn't
> really seem to map to your idea if a device model represented by a URI.
> How are supported URIs exposed and specified when the device is
> instantiated?
> 
> Same for device configuration, we might have per device attributes in
> host sysfs defining the configuration of a given mdev device, are these
> the device configuration values?  It seems like you're referring to
> something much more QEMU centric, but vfio-pci in QEMU handles all
> devices the same, aside from quirks.
> 
> Likewise, I don't know where versions would be exposed in the current
> vfio interface.

"Instantiating" means writing to the mdev "create" sysfs attr. I am not
very familiar with mdev so this could be totally wrong, but I'll try to
define a mapping:

1. The mdev driver sets up struct
   mdev_parent_opts->supported_type_groups as follows:

  /* Device model URI */
  static ssize_t model_show(struct kobject *kobj,
                            struct device *dev,
                            char *buf)
  {
      return sprintf(buf, "https://vendor-a.com/my-nic\n";);
  }
  static MDEV_TYPE_ATTR_RO(model);

  /* Receive Side Scaling (RSS) */
  static ssize_t rss_show(struct kobject *kobj,
                          struct dev *dev,
                          char *buf)
  {
      return sprintf(buf, "%d\n", ...->rss);
  }
  static ssize_t rss_store(struct kobject *kobj,
                           struct attribute *attr,
                           const char *page,
                           size_t count)
  {
      char *p = (char *) page;
      unsigned long val = simple_strtoul(p, &p, 10);

      ...->rss = !!val;
      return count;
  }
  static MDEV_TYPE_ATTR_RW(rss);

  /* Device version */
  static ssize_t version_show(struct kobject *kobj,
                              struct dev *dev,
                              char *buf)
  {
      return sprintf(buf, "%u\n", ...->version);
  }
  static ssize_t version_store(struct kobject *kobj,
                               struct attribute *attr,
                               const char *page,
                               size_t count)
  {
      char *p = (char *) page;
      unsigned long val = simple_strtoul(p, &p, 10);

      /* Set device configuration parameters to their defaults */
      switch (version) {
      case 1:
          ...->rss = false;
          ...->version = 1;
          break;

      case 2:
          ...->rss = true;
          ...->version = 2;
          break;

      default:
          return -ENOTSUPP;
      }

      return count;
  }
  static MDEV_TYPE_ATTR_RW(rss);

  static struct attribute *mdev_type_my_nic_attrs[] = {
      &mdev_type_attr_model.attr,
      &mdev_type_attr_rss.attr,
      &mdev_type_attr_version.attr,
      NULL,
  };

  static struct attribute_group mdev_type_group_my_nic = {
      .name  = "my-nic", /* shorthand name */
      .attrs = mdev_type_my_nic_attrs,
  };

  struct attribute_group *supported_type_groups[] = {
      &mdev_type_group_my_nic,
      NULL,
  };

2. The userspace tooling enumerates supported device models by reading
   the "model" sysfs attr from each supported type attr group.

3. Userspace picks the device model it wishes to instantiate and sets
   the "version" sysfs attr and other device configuration parameters as
   desired.

4. Userspace instantiates the device by writing to the mdev "create" sysfs
   attr. If instantiation succeeds then migrating a device state saved
   by the same device model with the same configuration parameters is
   possible.

Maybe a cleaner way to structure this is to include the version as part
of the supported type group. So "my-nic" becomes "my-nic-1", "my-nic-2",
etc. There would still be a "version" sysfs attr but it would be
read-only. Device configuration parameters would only be present if they
were actually available in that version. For example, "my-nic-1" would
not expose an "rss" sysfs attr because it was introduced in "my-nic-2".
I see pros and cons to both the approach I outlined above and this
alternative, maybe someone more familiar with mdev has a preference?

> There's also a desire to support the vfio migration interface on
> non-mdev vfio devices.  We don't know yet if those will be separate,
> device specific vfio bus drivers or be integrated into existing
> vfio-pci, but the host device is likely instantiated by binding to a
> driver, so again I don't really understand where you're proposing this
> negotiation occurs.  Will management tools be required to create a
> device on-demand to fulfill a migration request or can we manipulate an
> existing device into that desired.  Some management layers embrace the
> idea of device pools rather than dynamic creation.  Thanks,

The concept of device instantiation is natural for mdev and vfio-user,
but not essential.

When dealing with physical devices (even PCI SR-IOV), we don't need to
instantiate them explicitly. Device instances can already exist. As long
as we know their device model URI and configuration parameters we can
ensure migration compatibility.

For example, imagine a physical PCI NIC accompanied by a non-mdev VFIO
migration driver. The device model URI and configuration parameter
information can be distributed alongside the VFIO migration driver. It
could be available via modinfo(8), as a separate metadata file, via a
vendor-specific tool, etc.

Management tools need to match the device model/configuration from the
source device against the destination device. If the destination is
capable of supporting the source's device model/configuration then
migration can proceed safely.

Let's look at the case where we are migration from an older version of a
device to a newer version. On the source we have:

  model = https://vendor-a.com/my-nic

On the destination we have:

  model = https://vendor-a.com/my-nic
  rss = on

The two devices are incompatible because the destination exposes the RSS
feature that is not present on the source. The RSS feature involves
guest-visible hardware interface changes and a change to the device
state representation. It is not safe to migrate!

In this case an extra configuration step is necessary so that the
destination device can accept the device state from the source. The
management tool invokes a vendor-specific tool to put the device into
the right configuration:

  # vendor-tool set-migration-config --device 0000:00:04.0 \
                                     --model https://vendor-a.com/my-nic

(This tool only succeeds when the device is bound to VFIO but not yet
opened.)

The tool invokes ioctls on the vendor-specific VFIO driver that does two
things:
1. Tells the device to present the old hardware interface without RSS
2. Uses the old device state representation without RSS support

Does this approach fit?

> [1]https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg04519.html
> [2]https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00293.html
> [3]https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg02983.html
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
VFIO Migration, Stefan Hajnoczi, 2020/11/02
- Re: VFIO Migration, Cornelia Huck, 2020/11/02
  - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/02
    - Re: VFIO Migration, Gerd Hoffmann, 2020/11/04
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/04
    - Re: VFIO Migration, Gerd Hoffmann, 2020/11/05
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/05
- Re: VFIO Migration, Alex Williamson, 2020/11/02
  - Re: VFIO Migration, Stefan Hajnoczi <=
    - Re: VFIO Migration, Alex Williamson, 2020/11/03
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/03
    - Re: VFIO Migration, Yan Zhao, 2020/11/05
- Re: VFIO Migration, Jason Wang, 2020/11/03
  - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/03
    - Re: VFIO Migration, Jason Wang, 2020/11/03
    - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/04
- Re: VFIO Migration, Daniel P . Berrangé, 2020/11/03
  - Re: VFIO Migration, Stefan Hajnoczi, 2020/11/03
    - Re: VFIO Migration, Daniel P . Berrangé, 2020/11/03
Prev by Date: Re: [PULL v3 23/32] s390x/pci: Add routine to get the vfio dma available count
Next by Date: Re: [PULL v3 23/32] s390x/pci: Add routine to get the vfio dma available count
Previous by thread: Re: VFIO Migration
Next by thread: Re: VFIO Migration
Index(es):
- Date
- Thread