[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode
From: |
Jason Wang |
Subject: |
Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode |
Date: |
Wed, 24 Nov 2021 17:01:42 +0800 |
On Wed, Nov 24, 2021 at 4:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Nov 24, 2021 at 04:28:52PM +0800, Jason Wang wrote:
> > On Wed, Nov 24, 2021 at 3:57 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > When booting with scalable mode, I hit this error:
> > > >
> > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero
> > > > iova=0xfffff002, level=0x1slpte=0x102681803)
> > > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure
> > > > (dev=01:00:00, iova=0xfffff002)
> > > > qemu-system-x86_64: New fault is not recorded due to compression of
> > > > faults
> > > >
> > > > This is because the SNP bit is set since Linux kernel commit
> > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > seems to be an violation on the spec which said the SNP bit is
> > > > considered to be reserved if SC is not supported.
> > >
> > > When I was reading that commit, I was actually confused by this change:
> > >
> > > ---8<---
> > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > --- a/drivers/iommu/intel/iommu.c
> > > +++ b/drivers/iommu/intel/iommu.c
> > > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct
> > > intel_iommu *skip)
> > > rcu_read_lock();
> > > for_each_active_iommu(iommu, drhd) {
> > > if (iommu != skip) {
> > > - if (!ecap_sc_support(iommu->ecap)) {
> > > + /*
> > > + * If the hardware is operating in the scalable
> > > mode,
> > > + * the snooping control is always supported since
> > > we
> > > + * always set PASID-table-entry.PGSNP bit if the
> > > domain
> > > + * is managed outside (UNMANAGED).
> > > + */
> > > + if (!sm_supported(iommu) &&
> > > + !ecap_sc_support(iommu->ecap)) {
> > > ret = 0;
> > > break;
> > > }
> > > ---8<---
> > >
> > > Does it mean that for some hardwares that has sm_supported()==true, it'll
> > > have
> > > SC bit cleared in ecap register?
> >
> > I guess not, so it's probably only the problem of vIOMMU.
>
> But then what does the code mean above?
>
> If SC is required for scalable mode,
I guess the code was tested on hardware with both SC and SM.
> ecap_sc_support()==false already implies
> sm_supported()==false too. Then that check seems redundant.
To tell the truth, I'm not sure. And according to the spec SNP is
reserved if SC==false.
>
> >
> > > That sounds odd, and not sure why. Maybe Yi
> > > Liu or Yi Sun may know?
> >
> > Another interesting point is that, it looks to me after that commit
> > SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> > set.
> >
> >
> > >
> > > >
> > > > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > > > the future we may consider to add SC support.
> > >
> > > Oh yes, I remembered the last time we discussed this. Could you remind me
> > > what's missing for us to support SC?
> >
> > Exactly what you described below.
> >
> > >
> > > IIUC, for common device emulations we can just declare SC==1, right?
> >
> > Strictly speaking, only safe for the datapath that is running in the
> > Qemu. For things like vhost-user, I'm not sure it can check CC when
> > using VFIO.
>
> Hmm yeah.. though I'll just call those vhost-user backends to fall into below
> "device assignment" category too. Great to know we're on the same page here.
I see.
>
> >
> > > As all
> > > the DMAs (including kernel accels like vhost) will be from host
> > > processors so
> > > there're no coherent issues with guest vcpu threads.
> > >
> > > If that's correct, the only challenge is device assignment in any form (I
> > > am
> > > not familiar with vdpa; so perhaps that includes vfio, vpda and any other
> > > kind
> > > of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> > > capability from the host iommu groups that covers the assigned devices,
> > > and we
> > > only set SC==1 if we have cache coherency on all the devices?
> >
> > For VFIO yes, and we should prevent VFIO without CC to be plugged if
> > SC is advertised.
> >
> > For vDPA, we don't need to worry about it at all, kernel vDPA forces
> > IOMMU_CACHE now.
> >
> > vhost_vdpa_alloc_domain():
> >
> > if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> > return -ENOTSUPP;
> >
> > (For device with on-chip IOMMU, it's the parent and device that
> > guarantees the CC)
>
> Ah right, yes you mentioned it and I forgot.. Though I'm not sure we'd simply
> double-check again here (if we'll support vfio anyway, then we'll need to be
> able to read those out from IOMMU groups), because we shouldn't rely on that
> fact which is an implementation detail of vdpa, imho (say, when vdpa starts to
> support !SC someday).
Good point, but at that time we need to expose the !SC via uAPI which
is currently not supported.
>
> PS: I have other comments below in previous reply - please have a look too!
> :-D
Sorry for missing that part :(
>
> >
> > Thanks
> >
> >
> > >
> > > >
> > > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > > ---
> > > > hw/i386/intel_iommu.c | 18 ++++++++++++------
> > > > hw/i386/intel_iommu_internal.h | 2 ++
> > > > 2 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index 294499ee20..3bcac56c3e 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -969,7 +969,8 @@ static dma_addr_t
> > > > vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> > > > static uint64_t vtd_spte_rsvd[5];
> > > > static uint64_t vtd_spte_rsvd_large[5];
> > > >
> > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > + uint64_t slpte, uint32_t level)
> > > > {
> > > > uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > >
> > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte,
> > > > uint32_t level)
> > > > rsvd_mask = vtd_spte_rsvd_large[level];
> > > > }
> > > >
> > > > + if (s->scalable_mode) {
> > > > + rsvd_mask &= ~VTD_SPTE_SNP;
> > > > + }
> > >
> > > IMHO what we want to do is only to skip the leaves of pgtables on SNP, so
> > > maybe
> > > we still want to keep checking the bit 11 reserved for e.g. common
> > > pgtable dir
> > > entries?
Maybe, but it's probably a question that can only be answered by
Intel. I can change it for the next version if you stick.
> > >
> > > To do so, how about directly modifying the vtd_spte_rsvd* fields in
> > > vtd_init()?
> > > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> > >
> > > - vtd_spte_rsvd[1] (4K)
> > > - vtd_spte_rsvd_large[2] (2M)
> > > - vtd_spte_rsvd_large[3] (1G)
> > >
> > > What do you think? Then we avoid passing IntelIOMMUState* all over too.
I started a version like that:), it should work, I will change that if
it was agreed by everyone.
The reason that I changed to pass IntelIOMMUState is that it results
in a smaller changeset. The reason is that I tend to introduce new
rsvd bits for SM mode since after checking vtd 3.3 it looks have
different reserved bit requirement than before (at least 1.2)
Thanks
>
> [Here]
>
> > >
> > > > +
> > > > return slpte & rsvd_mask;
> > > > }
> > > >
> > > > @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s,
> > > > VTDContextEntry *ce,
> > > > iova, level, slpte, is_write);
> > > > return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
> > > > }
> > > > - if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > > + if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > > > error_report_once("%s: detected splte reserve non-zero "
> > > > "iova=0x%" PRIx64 ", level=0x%" PRIx32
> > > > "slpte=0x%" PRIx64 ")", __func__, iova,
> > > > @@ -1185,7 +1190,8 @@ static int vtd_page_walk_one(IOMMUTLBEvent
> > > > *event, vtd_page_walk_info *info)
> > > > * @write: whether parent level has write permission
> > > > * @info: constant information for the page walk
> > > > */
> > > > -static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> > > > +static int vtd_page_walk_level(IntelIOMMUState *s,
> > > > + dma_addr_t addr, uint64_t start,
> > > > uint64_t end, uint32_t level, bool read,
> > > > bool write, vtd_page_walk_info *info)
> > > > {
> > > > @@ -1214,7 +1220,7 @@ static int vtd_page_walk_level(dma_addr_t addr,
> > > > uint64_t start,
> > > > goto next;
> > > > }
> > > >
> > > > - if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> > > > + if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
> > > > trace_vtd_page_walk_skip_reserve(iova, iova_next);
> > > > goto next;
> > > > }
> > > > @@ -1235,7 +1241,7 @@ static int vtd_page_walk_level(dma_addr_t addr,
> > > > uint64_t start,
> > > > * This is a valid PDE (or even bigger than PDE). We need
> > > > * to walk one further level.
> > > > */
> > > > - ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte,
> > > > info->aw),
> > > > + ret = vtd_page_walk_level(s, vtd_get_slpte_addr(slpte,
> > > > info->aw),
> > > > iova, MIN(iova_next, end), level
> > > > - 1,
> > > > read_cur, write_cur, info);
> > > > } else {
> > > > @@ -1294,7 +1300,7 @@ static int vtd_page_walk(IntelIOMMUState *s,
> > > > VTDContextEntry *ce,
> > > > end = vtd_iova_limit(s, ce, info->aw);
> > > > }
> > > >
> > > > - return vtd_page_walk_level(addr, start, end, level, true, true,
> > > > info);
> > > > + return vtd_page_walk_level(s, addr, start, end, level, true, true,
> > > > info);
> > > > }
> > > >
> > > > static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
> > > > diff --git a/hw/i386/intel_iommu_internal.h
> > > > b/hw/i386/intel_iommu_internal.h
> > > > index 3d5487fe2c..a6c788049b 100644
> > > > --- a/hw/i386/intel_iommu_internal.h
> > > > +++ b/hw/i386/intel_iommu_internal.h
> > > > @@ -388,6 +388,8 @@ typedef union VTDInvDesc VTDInvDesc;
> > > > #define VTD_INV_DESC_DEVICE_IOTLB_RSVD_LO 0xffff0000ffe0fff8
> > > >
> > > > /* Rsvd field masks for spte */
> > > > +#define VTD_SPTE_SNP 0x800ULL
> > > > +
> > > > #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, dt_supported) \
> > > > dt_supported ? \
> > > > (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM))
> > > > : \
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>
- [PATCH] intel-iommu: ignore SNP bit in scalable mode, Jason Wang, 2021/11/24
- Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Peter Xu, 2021/11/24
- Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Jason Wang, 2021/11/24
- Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Peter Xu, 2021/11/24
- Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode,
Jason Wang <=
- Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Peter Xu, 2021/11/24
- Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Jason Wang, 2021/11/24
- Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Peter Xu, 2021/11/24
- Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Jason Wang, 2021/11/24
- RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Liu, Yi L, 2021/11/25
- RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Liu, Yi L, 2021/11/25
RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode, Liu, Yi L, 2021/11/24