[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Question] SR-IOV VF 'surprise removal' and vfio_reset behavior in pSeri

From: Daniel Henrique Barboza
Subject: [Question] SR-IOV VF 'surprise removal' and vfio_reset behavior in pSeries
Date: Mon, 4 Jan 2021 10:35:45 -0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0


This question came up while I was investigating a Libvirt bug [1], where an 
user is removing
VFs from the host while Libvirt domains was using them, causing Libvirt to 
remain in
an inconsistent state. I'm trying to alleviate the effects of this in Libvirt 
(see [2] if curious),
but QEMU is throwing some messages in the terminal that, although it appears to 
be benign,
I'm not sure if it's a symptom of something that should be fixed.

In a Power 9 server running a Mellanox MT28800 SR-IOV netcard I have the 
following IOMMU
settings, where the physical card is at Group 0 and all the VFs are allocated 
from Group 12 and

IOMMU Group 0 0000:01:00.0 Infiniband controller [0207]: Mellanox Technologies 
MT28800 Family [ConnectX-5 Ex] [15b3:1019]
IOMMU Group 12 0000:01:00.2 Infiniband controller [0207]: Mellanox Technologies 
MT27800 Family [ConnectX-5 Virtual Function] [15b3:1018]
IOMMU Group 13 0000:01:00.3 Infiniband controller [0207]: Mellanox Technologies 
MT27800 Family [ConnectX-5 Virtual Function] [15b3:1018]

Creating a guest with the Group 12 VF and trying to remove the VF from the host 

echo 0 > /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs

Makes the guest remove the VF card, but throwing a warning/error message in 
QEMU log:

"qemu-system-ppc64: vfio: Cannot reset device 0000:01:00.2, depends on group 0 which 
is not owned."

I found this message confusing because the VF was occupying IOMMU group 12, but 
the message is
claiming that the reset wasn't possible because Group 0 wasn't owned by the 

Digging it a bit, the hotunplug is fired up via the poweroff state of the card 
triggering pSeries internals,
and then reaching spapr_pci_unplug() in hw/ppc/spapr_pci.c. The body of the 
function reads:

    /* some version guests do not wait for completion of a device
     * cleanup (generally done asynchronously by the kernel) before
     * signaling to QEMU that the device is safe, but instead sleep
     * for some 'safe' period of time. unfortunately on a busy host
     * this sleep isn't guaranteed to be long enough, resulting in
     * bad things like IRQ lines being left asserted during final
     * device removal. to deal with this we call reset just prior
     * to finalizing the device, which will put the device back into
     * an 'idle' state, as the device cleanup code expects.

My first question is right at this point: do we need PCI reset for a VF 
removal?  I am not sure about
handling IRQ lines asserted for a device that the kernel is going to remove.

Going on further to the origin on the warning message we get to hw/vfio/pci.c, 
The VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl() is returning all VFs of the 
card, including
the physical function, in the vfio_pci_hot_reset_info struct. Then, down where 
it verifies if all
IOMMU groups required for reset belongs to the process, it fails to reset the 
VF because QEMU
does not have Group 0 ownership:

    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info);
    if (ret) {
        ret = -errno;
        error_report("vfio: hot reset info failed: %m");
        goto out_single;


        QLIST_FOREACH(group, &vfio_group_list, next) {
            if (group->groupid == devices[i].group_id) {

        if (!group) {
            if (!vdev->has_pm_reset) {
                error_report("vfio: Cannot reset device %s, "
                             "depends on group %d which is not owned.",
                             vdev->vbasedev.name, devices[i].group_id);
            ret = -EPERM;
            goto out;

This message is not clear to me because I'm aware that the VF was in Group 12, 
but apparently the
code is demanding ownership of all IOMMU Groups related to the card to allow 
the reset.

The second question: is this intended?  If not, then someone is behaving badly 
(perhaps the card driver,
mlx5_core) and reporting wrong info to that VFIO ioctl(). If this reset 
behavior is intended, then I
might insert a code in spapr_pci_unplug() to skip resetting the VF in this 
particular case to avoid the
error message (assuming that we really can live without a reset in this case).



[1] https://gitlab.com/libvirt/libvirt/-/issues/72
[2] https://www.redhat.com/archives/libvir-list/2021-January/msg00028.html

reply via email to

[Prev in Thread] Current Thread [Next in Thread]