qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)


From: Alex Williamson
Subject: Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
Date: Thu, 10 Mar 2022 11:35:41 -0700

On Thu, 10 Mar 2022 10:00:29 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 3/7/2022 5:16 PM, Alex Williamson wrote:
> > On Wed, 22 Dec 2021 11:05:24 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer 
> >> *container, int group_fd,
> >>  {
> >>      int iommu_type, ret;
> >>  
> >> +    /*
> >> +     * If container is reused, just set its type and skip the ioctls, as 
> >> the
> >> +     * container and group are already configured in the kernel.
> >> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> >> +     * If you ever add new types or spapr cpr support, kind reader, please
> >> +     * also implement VFIO_GET_IOMMU.
> >> +     */  
> > 
> > VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> > problem is that vfio_iommu_type1_check_extension() should actually base
> > some of the details on the instantiated vfio_iommu, ex.
> > 
> >     switch (arg) {
> >     case VFIO_TYPE1_IOMMU:
> >             return (iommu && iommu->v2) ? 0 : 1;
> >     case VFIO_UNMAP_ALL:
> >     case VFIO_UPDATE_VADDR:
> >     case VFIO_TYPE1v2_IOMMU:
> >             return (iommu && !iommu->v2) ? 0 : 1;
> >     case VFIO_TYPE1_NESTING_IOMMU:
> >             return (iommu && !iommu->nesting) ? 0 : 1;
> >     ...
> > 
> > We can't support v1 if we've already set a v2 container and vice versa.
> > There are probably some corner cases and compatibility to puzzle
> > through, but I wouldn't think we need a new ioctl to check this.  
> 
> That change makes sense, and may be worth while on its own merits, but does 
> not
> solve the problem, which is that qemu will not be able to infer iommu_type in
> the future if new types are added.  Given:
>   * a new kernel supporting shiny new TYPE1v3
>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has 
> no
>     knowledge of v3
>   * live update to qemu which supports v3, which will be listed first in 
> vfio_get_iommu_type.
> 
> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in 
> vfio_container_region_add,
> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function 
> correctly.
> 
> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same 
> time our
> hypothetical future developer adds TYPE1v3.  The current inability to ask the 
> kernel
> "what are you" about a container feels like a bug to me.

Hmm, I don't think the kernel has an innate responsibility to remind
the user of a configuration that they've already made.  But I also
don't follow your TYPE1v3 example.  If we added such a type, I imagine
the switch would change to:

        switch (arg)
        case VFIO_TYPE1_IOMMU:
                return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
        case VFIO_UNMAP_ALL:
        case VFIO_UPDATE_VADDR:
                return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
        case VFIO_TYPE1v2_IOMMU:
                return (iommu && !iommu-v2) ? 0 : 1;
        case VFIO_TYPE1v3_IOMMU:
                return (iommu && !iommu->v3) ? 0 : 1;
        ...

How would that not allow exactly the scenario described, ie. new QEMU
can see that old QEMU left it a v2 IOMMU.

...
> >> +
> >> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> >> +{
> >> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> >> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> >> +        error_setg(errp, "VFIO container does not support 
> >> VFIO_UPDATE_VADDR "
> >> +                         "or VFIO_UNMAP_ALL");
> >> +        return false;
> >> +    } else {
> >> +        return true;
> >> +    }
> >> +}  
> > 
> > We could have minimally used this where we assumed a TYPE1v2 container.  
> 
> Are you referring to vfio_init_container (discussed above)?
> Are you suggesting that, if reused is true, we validate those extensions are
> present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?

Yeah, though maybe it's not sufficiently precise to be worthwhile given
the current kernel behavior.

> >> +
> >> +/*
> >> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> >> + */
> >> +int vfio_cpr_save(Error **errp)
> >> +{
> >> +    ERRP_GUARD();
> >> +    VFIOAddressSpace *space;
> >> +    VFIOContainer *container;
> >> +
> >> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> >> +        QLIST_FOREACH(container, &space->containers, next) {
> >> +            if (!vfio_is_cpr_capable(container, errp)) {
> >> +                return -1;
> >> +            }
> >> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> >> +                return -1;
> >> +            }
> >> +        }
> >> +    }  
> > 
> > Seems like we ought to validate all containers support CPR before we
> > start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
> > this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
> > replay the listeners to remap the vaddrs in case of an error?  
> 
> Already done.  I refactored that code into a separate patch to tease out some
> of the complexity:
>   vfio-pci: recover from unmap-all-vaddr failure

Sorry, didn't get to that one til after I'd sent comments here.

...
> >> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> >> index a4da24e..a4007cf 100644
> >> --- a/include/migration/cpr.h
> >> +++ b/include/migration/cpr.h
> >> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
> >>  int cpr_state_load(Error **errp);
> >>  void cpr_state_print(void);
> >>  
> >> +int cpr_vfio_save(Error **errp);
> >> +int cpr_vfio_load(Error **errp);
> >> +
> >>  #endif
> >> diff --git a/migration/cpr.c b/migration/cpr.c
> >> index 37eca66..cee82cf 100644
> >> --- a/migration/cpr.c
> >> +++ b/migration/cpr.c
> >> @@ -7,6 +7,7 @@
> >>  
> >>  #include "qemu/osdep.h"
> >>  #include "exec/memory.h"
> >> +#include "hw/vfio/vfio-common.h"
> >>  #include "io/channel-buffer.h"
> >>  #include "io/channel-file.h"
> >>  #include "migration.h"
> >> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
> >>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
> >>          return;
> >>      }
> >> -
> >> +    if (cpr_vfio_save(errp)) {
> >> +        return;
> >> +    }  
> > 
> > Why is vfio so unique that it needs separate handlers versus other
> > devices?  Thanks,  
> 
> In earlier patches these functions fiddled with more objects, but at this 
> point
> they are simple enough to convert to pre_save and post_load vmstate handlers 
> for
> the container and group objects.  However, we would still need to call 
> special 
> functons for vfio from qmp_cpr_exec:
> 
>   * validate all containers support CPR before we start blasting vaddrs
>     However, I could validate all, in every call to pre_save for each 
> container.
>     That would be less efficient, but fits the vmstate model.

Would it be a better option to mirror the migration blocker support, ie.
any device that doesn't support cpr registers a blocker and generic
code only needs to keep track of whether any blockers are registered.

>   * restore all vaddr's if qemu_save_device_state fails.
>     However, I could recover for all containers inside pre_save when one 
> container fails.
>     Feels strange touching all objects in a function for one, but there is no 
> real
>     downside.

I'm not as familiar as I should be with migration callbacks, thanks to
mostly not supporting it for vfio devices, but it seems strange to me
that there's no existing callback or notifier per device to propagate
save failure.  Do we not at least get some sort of resume callback in
that case?

As an alternative, maybe each container could register a vm change
handler that would trigger reloading vaddrs if we move to a running
state and a flag on the container indicates vaddrs were invalidated?
Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]