qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode


From: Jason Wang
Subject: Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode
Date: Wed, 24 Nov 2021 17:01:42 +0800

On Wed, Nov 24, 2021 at 4:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 24, 2021 at 04:28:52PM +0800, Jason Wang wrote:
> > On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > When booting with scalable mode, I hit this error:
> > > >
> > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero 
> > > > iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> > > > (dev=01:00:00, iova=0xfffff002)
> > > > qemu-system-x86_64: New fault is not recorded due to compression of 
> > > > faults
> > > >
> > > > This is because the SNP bit is set since Linux kernel commit
> > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > seems to be an violation on the spec which said the SNP bit is
> > > > considered to be reserved if SC is not supported.
> > >
> > > When I was reading that commit, I was actually confused by this change:
> > >
> > > ---8<---
> > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > --- a/drivers/iommu/intel/iommu.c
> > > +++ b/drivers/iommu/intel/iommu.c
> > > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct 
> > > intel_iommu *skip)
> > >         rcu_read_lock();
> > >         for_each_active_iommu(iommu, drhd) {
> > >                 if (iommu != skip) {
> > > -                       if (!ecap_sc_support(iommu->ecap)) {
> > > +                       /*
> > > +                        * If the hardware is operating in the scalable 
> > > mode,
> > > +                        * the snooping control is always supported since 
> > > we
> > > +                        * always set PASID-table-entry.PGSNP bit if the 
> > > domain
> > > +                        * is managed outside (UNMANAGED).
> > > +                        */
> > > +                       if (!sm_supported(iommu) &&
> > > +                           !ecap_sc_support(iommu->ecap)) {
> > >                                 ret = 0;
> > >                                 break;
> > >                         }
> > > ---8<---
> > >
> > > Does it mean that for some hardwares that has sm_supported()==true, it'll 
> > > have
> > > SC bit cleared in ecap register?
> >
> > I guess not, so it's probably only the problem of vIOMMU.
>
> But then what does the code mean above?
>
> If SC is required for scalable mode,

I guess the code was tested on hardware with both SC and SM.

> ecap_sc_support()==false already implies
> sm_supported()==false too.  Then that check seems redundant.

To tell the truth, I'm not sure. And according to the spec SNP is
reserved if SC==false.

>
> >
> > > That sounds odd, and not sure why.  Maybe Yi
> > > Liu or Yi Sun may know?
> >
> > Another interesting point is that, it looks to me after that commit
> > SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> > set.
> >
> >
> > >
> > > >
> > > > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > > > the future we may consider to add SC support.
> > >
> > > Oh yes, I remembered the last time we discussed this.  Could you remind me
> > > what's missing for us to support SC?
> >
> > Exactly what you described below.
> >
> > >
> > > IIUC, for common device emulations we can just declare SC==1, right?
> >
> > Strictly speaking, only safe for the datapath that is running in the
> > Qemu. For things like vhost-user, I'm not sure it can check CC when
> > using VFIO.
>
> Hmm yeah.. though I'll just call those vhost-user backends to fall into below
> "device assignment" category too.  Great to know we're on the same page here.

I see.

>
> >
> > >  As all
> > > the DMAs (including kernel accels like vhost) will be from host 
> > > processors so
> > > there're no coherent issues with guest vcpu threads.
> > >
> > > If that's correct, the only challenge is device assignment in any form (I 
> > > am
> > > not familiar with vdpa; so perhaps that includes vfio, vpda and any other 
> > > kind
> > > of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> > > capability from the host iommu groups that covers the assigned devices, 
> > > and we
> > > only set SC==1 if we have cache coherency on all the devices?
> >
> > For VFIO yes, and we should prevent VFIO without CC to be plugged if
> > SC is advertised.
> >
> > For vDPA, we don't need to worry about it at all, kernel vDPA forces
> > IOMMU_CACHE now.
> >
> > vhost_vdpa_alloc_domain():
> >
> >         if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> >                 return -ENOTSUPP;
> >
> > (For device with on-chip IOMMU, it's the parent and device that
> > guarantees the CC)
>
> Ah right, yes you mentioned it and I forgot..  Though I'm not sure we'd simply
> double-check again here (if we'll support vfio anyway, then we'll need to be
> able to read those out from IOMMU groups), because we shouldn't rely on that
> fact which is an implementation detail of vdpa, imho (say, when vdpa starts to
> support !SC someday).

Good point, but at that time we need to expose the !SC via uAPI which
is currently not supported.

>
> PS: I have other comments below in previous reply - please have a look too! 
> :-D

Sorry for missing that part :(

>
> >
> > Thanks
> >
> >
> > >
> > > >
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > ---
> > > >  hw/i386/intel_iommu.c          | 18 ++++++++++++------
> > > >  hw/i386/intel_iommu_internal.h |  2 ++
> > > >  2 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index 294499ee20..3bcac56c3e 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -969,7 +969,8 @@ static dma_addr_t 
> > > > vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> > > >  static uint64_t vtd_spte_rsvd[5];
> > > >  static uint64_t vtd_spte_rsvd_large[5];
> > > >
> > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > +                                   uint64_t slpte, uint32_t level)
> > > >  {
> > > >      uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > >
> > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, 
> > > > uint32_t level)
> > > >          rsvd_mask = vtd_spte_rsvd_large[level];
> > > >      }
> > > >
> > > > +    if (s->scalable_mode) {
> > > > +        rsvd_mask &= ~VTD_SPTE_SNP;
> > > > +    }
> > >
> > > IMHO what we want to do is only to skip the leaves of pgtables on SNP, so 
> > > maybe
> > > we still want to keep checking the bit 11 reserved for e.g. common 
> > > pgtable dir
> > > entries?

Maybe, but it's probably a question that can only be answered by
Intel. I can change it for the next version if you stick.

> > >
> > > To do so, how about directly modifying the vtd_spte_rsvd* fields in 
> > > vtd_init()?
> > > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> > >
> > >   - vtd_spte_rsvd[1] (4K)
> > >   - vtd_spte_rsvd_large[2] (2M)
> > >   - vtd_spte_rsvd_large[3] (1G)
> > >
> > > What do you think?  Then we avoid passing IntelIOMMUState* all over too.

I started a version like that:), it should work, I will change that if
it was agreed by everyone.

The reason that I changed to pass IntelIOMMUState is that it results
in a smaller changeset. The reason is that I tend to introduce new
rsvd bits for SM mode since after checking vtd 3.3 it looks have
different reserved bit requirement than before (at least 1.2)

Thanks

>
> [Here]
>
> > >
> > > > +
> > > >      return slpte & rsvd_mask;
> > > >  }
> > > >
> > > > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, 
> > > > VTDContextEntry *ce,
> > > >                                iova, level, slpte, is_write);
> > > >              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> > > >          }
> > > > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > > >              error_report_once("%s: detected splte reserve non-zero "
> > > >                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> > > >                                "slpte=0x%" PRIx64 ")", __func__, iova,
> > > > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent 
> > > > *event, vtd_page_walk_info *info)
> > > >   * @write: whether parent level has write permission
> > > >   * @info: constant information for the page walk
> > > >   */
> > > > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > > +static int vtd_page_walk_level(IntelIOMMUState *s,
> > > > +                               dma_addr_t addr, uint64_t start,
> > > >                                 uint64_t end, uint32_t level, bool read,
> > > >                                 bool write, vtd_page_walk_info *info)
> > > >  {
> > > > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr, 
> > > > uint64_t start,
> > > >              goto next;
> > > >          }
> > > >
> > > > -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > > +        if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > > >              trace_vtd_page_walk_skip_reserve(iova, iova_next);
> > > >              goto next;
> > > >          }
> > > > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr, 
> > > > uint64_t start,
> > > >               * This is a valid PDE (or even bigger than PDE).  We need
> > > >               * to walk one further level.
> > > >               */
> > > > -            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, 
> > > > info->aw),
> > > > +            ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte, 
> > > > info->aw),
> > > >                                        iova, MIN(iova_next, end), level 
> > > > - 1,
> > > >                                        read_cur, write_cur, info);
> > > >          } else {
> > > > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s, 
> > > > VTDContextEntry *ce,
> > > >          end = vtd_iova_limit(s, ce, info->aw);
> > > >      }
> > > >
> > > > -    return vtd_page_walk_level(addr, start, end, level, true, true, 
> > > > info);
> > > > +    return vtd_page_walk_level(s, addr, start, end, level, true, true, 
> > > > info);
> > > >  }
> > > >
> > > >  static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> > > > diff --git a/hw/i386/intel_iommu_internal.h 
> > > > b/hw/i386/intel_iommu_internal.h
> > > > index 3d5487fe2c..a6c788049b 100644
> > > > --- a/hw/i386/intel_iommu_internal.h
> > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > >  #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
> > > >
> > > >  /* Rsvd field masks for spte */
> > > > +#define VTD_SPTE_SNP 0x800ULL
> > > > +
> > > >  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
> > > >          dt_supported ? \
> > > >          (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) 
> > > > : \
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]