qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode


From: Jason Wang
Subject: Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode
Date: Wed, 24 Nov 2021 16:28:52 +0800

On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > When booting with scalable mode, I hit this error:
> >
> > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero 
> > iova=0xfffff002, level=0x1slpte=0x102681803)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> > (dev=01:00:00, iova=0xfffff002)
> > qemu-system-x86_64: New fault is not recorded due to compression of faults
> >
> > This is because the SNP bit is set since Linux kernel commit
> > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > seems to be an violation on the spec which said the SNP bit is
> > considered to be reserved if SC is not supported.
>
> When I was reading that commit, I was actually confused by this change:
>
> ---8<---
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 956a02eb40b4..0ee5f1bd8af2 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct 
> intel_iommu *skip)
>         rcu_read_lock();
>         for_each_active_iommu(iommu, drhd) {
>                 if (iommu != skip) {
> -                       if (!ecap_sc_support(iommu->ecap)) {
> +                       /*
> +                        * If the hardware is operating in the scalable mode,
> +                        * the snooping control is always supported since we
> +                        * always set PASID-table-entry.PGSNP bit if the 
> domain
> +                        * is managed outside (UNMANAGED).
> +                        */
> +                       if (!sm_supported(iommu) &&
> +                           !ecap_sc_support(iommu->ecap)) {
>                                 ret = 0;
>                                 break;
>                         }
> ---8<---
>
> Does it mean that for some hardwares that has sm_supported()==true, it'll have
> SC bit cleared in ecap register?

I guess not, so it's probably only the problem of vIOMMU.

> That sounds odd, and not sure why.  Maybe Yi
> Liu or Yi Sun may know?

Another interesting point is that, it looks to me after that commit
SNP is used for the domain that is not UNMANAGED even if PGSNP is not
set.


>
> >
> > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > the future we may consider to add SC support.
>
> Oh yes, I remembered the last time we discussed this.  Could you remind me
> what's missing for us to support SC?

Exactly what you described below.

>
> IIUC, for common device emulations we can just declare SC==1, right?

Strictly speaking, only safe for the datapath that is running in the
Qemu. For things like vhost-user, I'm not sure it can check CC when
using VFIO.

>  As all
> the DMAs (including kernel accels like vhost) will be from host processors so
> there're no coherent issues with guest vcpu threads.
>
> If that's correct, the only challenge is device assignment in any form (I am
> not familiar with vdpa; so perhaps that includes vfio, vpda and any other kind
> of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> capability from the host iommu groups that covers the assigned devices, and we
> only set SC==1 if we have cache coherency on all the devices?

For VFIO yes, and we should prevent VFIO without CC to be plugged if
SC is advertised.

For vDPA, we don't need to worry about it at all, kernel vDPA forces
IOMMU_CACHE now.

vhost_vdpa_alloc_domain():

        if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
                return -ENOTSUPP;

(For device with on-chip IOMMU, it's the parent and device that
guarantees the CC)

Thanks


>
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c          | 18 ++++++++++++------
> >  hw/i386/intel_iommu_internal.h |  2 ++
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 294499ee20..3bcac56c3e 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -969,7 +969,8 @@ static dma_addr_t 
> > vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> >  static uint64_t vtd_spte_rsvd[5];
> >  static uint64_t vtd_spte_rsvd_large[5];
> >
> > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > +                                   uint64_t slpte, uint32_t level)
> >  {
> >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> >
> > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, 
> > uint32_t level)
> >          rsvd_mask = vtd_spte_rsvd_large[level];
> >      }
> >
> > +    if (s->scalable_mode) {
> > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > +    }
>
> IMHO what we want to do is only to skip the leaves of pgtables on SNP, so 
> maybe
> we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
> entries?
>
> To do so, how about directly modifying the vtd_spte_rsvd* fields in 
> vtd_init()?
> I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
>
>   - vtd_spte_rsvd[1] (4K)
>   - vtd_spte_rsvd_large[2] (2M)
>   - vtd_spte_rsvd_large[3] (1G)
>
> What do you think?  Then we avoid passing IntelIOMMUState* all over too.
>
> > +
> >      return slpte & rsvd_mask;
> >  }
> >
> > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, 
> > VTDContextEntry *ce,
> >                                iova, level, slpte, is_write);
> >              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> >          }
> > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> >              error_report_once("%s: detected splte reserve non-zero "
> >                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> >                                "slpte=0x%" PRIx64 ")", __func__, iova,
> > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent *event, 
> > vtd_page_walk_info *info)
> >   * @write: whether parent level has write permission
> >   * @info: constant information for the page walk
> >   */
> > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > +static int vtd_page_walk_level(IntelIOMMUState *s,
> > +                               dma_addr_t addr, uint64_t start,
> >                                 uint64_t end, uint32_t level, bool read,
> >                                 bool write, vtd_page_walk_info *info)
> >  {
> > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, 
> > uint64_t start,
> >              goto next;
> >          }
> >
> > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> >              trace_vtd_page_walk_skip_reserve(iova, iova_next);
> >              goto next;
> >          }
> > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, 
> > uint64_t start,
> >               * This is a valid PDE (or even bigger than PDE).  We need
> >               * to walk one further level.
> >               */
> > -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, info->aw),
> > +            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, 
> > info->aw),
> >                                        iova, MIN(iova_next, end), level - 1,
> >                                        read_cur, write_cur, info);
> >          } else {
> > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, 
> > VTDContextEntry *ce,
> >          end = vtd_iova_limit(s, ce, info->aw);
> >      }
> >
> > -    return vtd_page_walk_level(addr, start, end, level, true, true, info);
> > +    return vtd_page_walk_level(s, addr, start, end, level, true, true, 
> > info);
> >  }
> >
> >  static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> > diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> > index 3d5487fe2c..a6c788049b 100644
> > --- a/hw/i386/intel_iommu_internal.h
> > +++ b/hw/i386/intel_iommu_internal.h
> > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
> >  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
> >
> >  /* Rsvd field masks for spte */
> > +#define VTD_SPTE_SNP 0x800ULL
> > +
> >  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
> >          dt_supported ? \
> >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> > --
> > 2.25.1
> >
>
> --
> Peter Xu
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]