[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH V6 21/27] vfio-pci: cpr part 3 (intx)
From: |
Steven Sistare |
Subject: |
Re: [PATCH V6 21/27] vfio-pci: cpr part 3 (intx) |
Date: |
Mon, 11 Apr 2022 12:23:50 -0400 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 |
On 3/29/2022 7:03 AM, Fam Zheng wrote:
> On 2021-08-06 14:43, Steve Sistare wrote:
>> Preserve vfio INTX state across cpr restart. Preserve VFIOINTx fields as
>> follows:
>> pin : Recover this from the vfio config in kernel space
>> interrupt : Preserve its eventfd descriptor across exec.
>> unmask : Ditto
>> route.irq : This could perhaps be recovered in vfio_pci_post_load by
>> calling pci_device_route_intx_to_irq(pin), whose implementation reads
>> config space for a bridge device such as ich9. However, there is no
>> guarantee that the bridge vmstate is read before vfio vmstate. Rather
>> than fiddling with MigrationPriority for vmstate handlers, explicitly
>> save route.irq in vfio vmstate.
>> pending : save in vfio vmstate.
>> mmap_timeout, mmap_timer : Re-initialize
>> bool kvm_accel : Re-initialize
>>
>> In vfio_realize, defer calling vfio_intx_enable until the vmstate
>> is available, in vfio_pci_post_load. Modify vfio_intx_enable and
>> vfio_intx_kvm_enable to skip vfio initialization, but still perform
>> kvm initialization.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>
> Hi Steve,
>
> Not directly related to this patch, but since the context is close: it looks
> like this series only takes care of exec restart mode of vfio-pci, have you
> had
> any thoughts on kexec reboot mode with vfio-pci?
>
> The general idea is if DMAR context is not lost during kexec, we should be
> able
> to set up irqfds again and things will just work?
>
> Fam
Hi Fam,
I have thought about that use case, but only in general terms.
IMO it best fits in the cpr framework as a new mode (rather than as
a new -restore command line argument).
In your code below, you would have fewer code changes if you set
'reused = true' for the new mode, rather than testing both 'reused and
restored'
at multiple sites. Lastly, I cleaned up the vector handling somewhat from V6
to V7, so you may want to try your code using V7 as a base.
- Steve
> PS some more info below:
>
> I have some local kernel patches to kexec reboot most part of the host kernel
> while keeping IOMMU DMAR tables in a valid state; with that, not many extra
> things are needed in addition to restore it. A PoC is like below (I can share
> more details of the kernel changes if this patch makes any sense):
>
>
> commit f8951e58be86bd6e37f816394a9a73f28d8059fc
> Author: Fam Zheng <fam.zheng@bytedance.com>
> Date: Mon Mar 21 13:19:49 2022 +0000
>
> cpr: Add live-update support to vfio-pci devices
>
> In cpr-save, always serialize the vfio-pci states.
>
> In cpr-load, add a '-restore' mode that will do
> VFIO_GROUP_GET_DEVICE_FD_INTACT and skip DMAR setup, somewhat similar to
> the current cpr exec mode.
>
> Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
>
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index 73f4259556..e36f0ef97d 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -584,10 +584,15 @@ void msix_init_vector_notifiers(PCIDevice *dev,
> MSIVectorReleaseNotifier release_notifier,
> MSIVectorPollNotifier poll_notifier)
> {
> + int vector;
> +
> assert(use_notifier && release_notifier);
> dev->msix_vector_use_notifier = use_notifier;
> dev->msix_vector_release_notifier = release_notifier;
> dev->msix_vector_poll_notifier = poll_notifier;
> + for (vector = 0; vector < dev->msix_entries_nr; ++vector) {
> + msix_handle_mask_update(dev, vector, true);
> + }
> }
>
> int msix_set_vector_notifiers(PCIDevice *dev,
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 605ffbb5d0..f1240410a8 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -2066,6 +2066,9 @@ static int vfio_connect_container(VFIOGroup *group,
> AddressSpace *as,
> bool reused;
> VFIOAddressSpace *space;
>
> + if (restore) {
> + return 0;
> + }
> space = vfio_get_address_space(as);
> fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> reused = (fd > 0);
> @@ -2486,7 +2489,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
> fd = cpr_find_fd(name, 0);
> reused = (fd >= 0);
> if (!reused) {
> - fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> + int op = restore ? VFIO_GROUP_GET_DEVICE_FD_INTACT :
> VFIO_GROUP_GET_DEVICE_FD;
> + fd = ioctl(group->fd, op, name);
> }
>
> if (fd < 0) {
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index e32513c668..9da5f93228 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -361,7 +361,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error
> **errp)
> * Do not alter interrupt state during vfio_realize and cpr-load. The
> * reused flag is cleared thereafter.
> */
> - if (!vdev->pdev.reused) {
> + if (!vdev->pdev.reused && !restore) {
> vfio_disable_interrupts(vdev);
> }
>
> @@ -388,7 +388,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error
> **errp)
> fd = event_notifier_get_fd(&vdev->intx.interrupt);
> qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
>
> - if (vdev->pdev.reused) {
> + if (vdev->pdev.reused && !restore) {
> vfio_intx_reenable_kvm(vdev, &err);
> goto finish;
> }
> @@ -2326,6 +2326,9 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool
> single)
> int ret, i, count;
> bool multi = false;
>
> + if (restore) {
> + return 0;
> + }
> trace_vfio_pci_hot_reset(vdev->vbasedev.name, single ? "one" : "multi");
>
> if (!single) {
> @@ -3185,7 +3188,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
>
> /* Wait until cpr-load reads intx routing data to enable */
> - if (!pdev->reused) {
> + if (!pdev->reused && !restore) {
> ret = vfio_intx_enable(vdev, errp);
> if (ret) {
> goto out_deregister;
> @@ -3295,7 +3298,7 @@ static void vfio_pci_reset(DeviceState *dev)
> VFIOPCIDevice *vdev = VFIO_PCI(dev);
>
> /* Do not reset the device during qemu_system_reset prior to cpr-load */
> - if (vdev->pdev.reused) {
> + if (vdev->pdev.reused || restore) {
> return;
> }
>
> @@ -3429,33 +3432,40 @@ static void vfio_merge_config(VFIOPCIDevice *vdev)
>
> static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool
> msix)
> {
> - int i, fd;
> + int i, fd, ret;
> bool pending = false;
> PCIDevice *pdev = &vdev->pdev;
>
> + pdev->msix_entries_nr = nr_vectors;
> vdev->nr_vectors = nr_vectors;
> vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
> vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>
> - for (i = 0; i < nr_vectors; i++) {
> - VFIOMSIVector *vector = &vdev->msi_vectors[i];
> -
> - fd = load_event_fd(vdev, "interrupt", i);
> - if (fd >= 0) {
> - vfio_vector_init(vdev, i);
> - qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> + if (restore) {
> + ret = vfio_enable_vectors(vdev, true);
> + if (ret) {
> + error_report("vfio: failed to enable vectors, %d", ret);
> }
> + } else {
> + for (i = 0; i < nr_vectors; i++) {
> + VFIOMSIVector *vector = &vdev->msi_vectors[i];
>
> - if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
> - vfio_add_kvm_msi_virq(vdev, vector, i, msix);
> - }
> + fd = load_event_fd(vdev, "interrupt", i);
> + if (fd >= 0) {
> + vfio_vector_init(vdev, i);
> + qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> + }
>
> - if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> - set_bit(i, vdev->msix->pending);
> - pending = true;
> + if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
> + vfio_add_kvm_msi_virq(vdev, vector, i, msix);
> + }
> +
> + if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i))
> {
> + set_bit(i, vdev->msix->pending);
> + pending = true;
> + }
> }
> }
> -
> if (msix) {
> memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
> }
> @@ -3534,7 +3544,7 @@ static const VMStateDescription vfio_intx_vmstate = {
>
> static bool vfio_pci_needed(void *opaque)
> {
> - return cpr_get_mode() == CPR_MODE_RESTART;
> + return 1;
> }
>
> static const VMStateDescription vfio_pci_vmstate = {
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 6241c20fb1..0179b0aa90 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -26,6 +26,7 @@ void configure_rtc(QemuOpts *opts);
> void qemu_init_subsystems(void);
>
> extern int autostart;
> +extern int restore;
>
> typedef enum {
> VGA_NONE, VGA_STD, VGA_CIRRUS, VGA_VMWARE, VGA_XENFB, VGA_QXL,
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index e680594f27..65c3bab074 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -188,6 +188,8 @@ struct vfio_group_status {
> */
> #define VFIO_GROUP_GET_DEVICE_FD _IO(VFIO_TYPE, VFIO_BASE + 6)
>
> +#define VFIO_GROUP_GET_DEVICE_FD_INTACT _IO(VFIO_TYPE, VFIO_BASE + 21)
> +
> /* --------------- IOCTLs for DEVICE file descriptors --------------- */
>
> /**
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 8b90d04cb9..03666a59b3 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -3984,6 +3984,10 @@ SRST
> option is experimental.
> ERST
>
> +DEF("restore", 0, QEMU_OPTION_restore, \
> + "-restore restore mode",
> + QEMU_ARCH_ALL)
> +
> DEF("S", 0, QEMU_OPTION_S, \
> "-S freeze CPU at startup (use 'c' to start execution)\n",
> QEMU_ARCH_ALL)
> diff --git a/softmmu/globals.c b/softmmu/globals.c
> index a18fd8dcf3..6fcb5846b4 100644
> --- a/softmmu/globals.c
> +++ b/softmmu/globals.c
> @@ -41,6 +41,7 @@ bool enable_cpu_pm;
> int nb_nics;
> NICInfo nd_table[MAX_NICS];
> int autostart = 1;
> +int restore;
> int vga_interface_type = VGA_NONE;
> Chardev *parallel_hds[MAX_PARALLEL_PORTS];
> int win2k_install_hack;
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index f14e29e622..fba6b577cb 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -3088,6 +3088,9 @@ void qemu_init(int argc, char **argv, char **envp)
> case QEMU_OPTION_S:
> autostart = 0;
> break;
> + case QEMU_OPTION_restore:
> + restore = 1;
> + break;
> case QEMU_OPTION_k:
> keyboard_layout = optarg;
> break;
- Re: [PATCH V6 21/27] vfio-pci: cpr part 3 (intx),
Steven Sistare <=