[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Question] SR-IOV VF 'surprise removal' and vfio_reset behavior in p

From: Alex Williamson
Subject: Re: [Question] SR-IOV VF 'surprise removal' and vfio_reset behavior in pSeries
Date: Mon, 4 Jan 2021 10:55:43 -0700

On Mon, 4 Jan 2021 10:35:45 -0300
Daniel Henrique Barboza <danielhb413@gmail.com> wrote:

> Hi,
> This question came up while I was investigating a Libvirt bug [1], where an 
> user is removing
> VFs from the host while Libvirt domains was using them, causing Libvirt to 
> remain in
> an inconsistent state. I'm trying to alleviate the effects of this in Libvirt 
> (see [2] if curious),
> but QEMU is throwing some messages in the terminal that, although it appears 
> to be benign,
> I'm not sure if it's a symptom of something that should be fixed.
> In a Power 9 server running a Mellanox MT28800 SR-IOV netcard I have the 
> following IOMMU
> settings, where the physical card is at Group 0 and all the VFs are allocated 
> from Group 12 and
> onwards:
> IOMMU Group 0 0000:01:00.0 Infiniband controller [0207]: Mellanox 
> Technologies MT28800 Family [ConnectX-5 Ex] [15b3:1019]
> (...)
> IOMMU Group 12 0000:01:00.2 Infiniband controller [0207]: Mellanox 
> Technologies MT27800 Family [ConnectX-5 Virtual Function] [15b3:1018]
> IOMMU Group 13 0000:01:00.3 Infiniband controller [0207]: Mellanox 
> Technologies MT27800 Family [ConnectX-5 Virtual Function] [15b3:1018]
> (...)
> Creating a guest with the Group 12 VF and trying to remove the VF from the 
> host via
> echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs

FWIW, "surprise" removal is when the device is essentially already
removed, for example a hotplug NVMe drive that is simply yanked out of
the slot.  What you're describing in the above is a coordinated
removal where the PF driver will issue removal callbacks for each VF
device to ask the driver to unbind.

> Makes the guest remove the VF card, but throwing a warning/error message in 
> QEMU log:
> "qemu-system-ppc64: vfio: Cannot reset device 0000:01:00.2, depends on group 
> 0 which is not owned."

This looks like QEMU has hit the but reset code, where a bus reset of a
VF would necessarily require owning the PF, but VFs should always
support FLR.  I note that QEMU will try a bus reset if the
VFIO_DEVICE_RESET ioctl fails, which is where we'd try the FLR.  I
wonder if we're taking this path simply because locking issues in the
kernel prevent us from performing an FLR in this code path (ie. when
userspace is releasing the device from the driver remove callback).

> I found this message confusing because the VF was occupying IOMMU group 12, 
> but the message is
> claiming that the reset wasn't possible because Group 0 wasn't owned by the 
> process.

A bus reset would affect all the devices in the slot and therefore
requires ownership of all devices in the slot, but QEMU doesn't own the

> Digging it a bit, the hotunplug is fired up via the poweroff state of the 
> card triggering pSeries internals,
> and then reaching spapr_pci_unplug() in hw/ppc/spapr_pci.c. The body of the 
> function reads:
> -------
>      /* some version guests do not wait for completion of a device
>       * cleanup (generally done asynchronously by the kernel) before
>       * signaling to QEMU that the device is safe, but instead sleep
>       * for some 'safe' period of time. unfortunately on a busy host
>       * this sleep isn't guaranteed to be long enough, resulting in
>       * bad things like IRQ lines being left asserted during final
>       * device removal. to deal with this we call reset just prior
>       * to finalizing the device, which will put the device back into
>       * an 'idle' state, as the device cleanup code expects.
>       */
>      pci_device_reset(PCI_DEVICE(plugged_dev));
> -------
> My first question is right at this point: do we need PCI reset for a VF 
> removal?  I am not sure about
> handling IRQ lines asserted for a device that the kernel is going to remove.

I can't speak to this specific code, but it's not uncommon to reset a
device in order to quiesce it.  For a VF, it should be sufficient to
disable bus master to quiesce any interrupts since it would only
support MSI/X based interrupts.
> Going on further to the origin on the warning message we get to 
> hw/vfio/pci.c, vfio_pci_hot_reset().
> The VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl() is returning all VFs of the 
> card, including
> the physical function, in the vfio_pci_hot_reset_info struct. Then, down 
> where it verifies if all
> IOMMU groups required for reset belongs to the process, it fails to reset the 
> VF because QEMU
> does not have Group 0 ownership:
> -------
>      ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
>      if (ret) {
>          ret = -errno;
>          error_report("vfio: hot reset info failed: %m");
>          goto out_single;
>      }
> (...)
>          QLIST_FOREACH(group, &vfio_group_list, next) {
>              if (group->groupid == devices[i].group_id) {
>                  break;
>              }
>          }
>          if (!group) {
>              if (!vdev->has_pm_reset) {
>                  error_report("vfio: Cannot reset device %s, "
>                               "depends on group %d which is not owned.",
>                               vdev->vbasedev.name, devices[i].group_id);
>              }
>              ret = -EPERM;
>              goto out;
>          }
> -------
> This message is not clear to me because I'm aware that the VF was in Group 
> 12, but apparently the
> code is demanding ownership of all IOMMU Groups related to the card to allow 
> the reset.

Yup, we're in a failure path, this is never going to work for a VF, see:

static void vfio_pci_reset(DeviceState *dev)
    if (vdev->vbasedev.reset_works &&
        (vdev->has_flr || !vdev->has_pm_reset) &&
        !ioctl(vdev->vbasedev.fd, VFIO_DEVICE_RESET)) {

// This would work in the normal case, but in the kernel the try-locks are
// probably failing due to the device lock held by the remove callback path
// causing us to call into the bus reset path below.

        goto post_reset;

    /* See if we can do our own bus reset */
    if (!vfio_pci_hot_reset_one(vdev)) {
        goto post_reset;

AFAICT, this is undesirable, but harmless.  Ideally we could rework
some of the device lock code in the kernel to allow this to work, but
for now we just use try-locks to avoid deadlock and report failure if
we can't acquire the locks.  Potentially QEMU could look at the -EAGAIN
return value of the ioctl() and assume that it should skip the bus
reset attempt, which might just have a different warning, but
essentially leaving the device in the same state as it is now.

> The second question: is this intended?  If not, then someone is behaving 
> badly (perhaps the card driver,
> mlx5_core) and reporting wrong info to that VFIO ioctl(). If this reset 
> behavior is intended, then I
> might insert a code in spapr_pci_unplug() to skip resetting the VF in this 
> particular case to avoid the
> error message (assuming that we really can live without a reset in this case).

It's not intended, it's a side-effect of the device lock in the kernel
being difficult to workaround in this code path resulting in the
VFIO_DEVICE_RESET ioctl failing.  It's not unique to VFs either.  The
only thing I can suggest, barring tackling the device lock issues in
the kernel, would be to set a flag on the VFIOPCIDevice in
vfio_req_notifier_handler() that suppresses warnings in the reset path.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]