[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capabili

From: Huang, Kai
Subject: Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest
Date: Tue, 7 Jun 2016 17:21:06 +1200
User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1

On 6/7/2016 3:58 PM, Alex Williamson wrote:
On Tue, 7 Jun 2016 11:20:32 +0800
Peter Xu <address@hidden> wrote:

On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
On Mon, 6 Jun 2016 21:43:17 +0800
Peter Xu <address@hidden> wrote:

On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
On Mon, 6 Jun 2016 13:04:07 +0800
Peter Xu <address@hidden> wrote:
Besides the reason that there might have guests that do not support
CM=1, will there be performance considerations? When user's
configuration does not require CM capability (e.g., generic VM
configuration, without VFIO), shall we allow user to disable the CM
bit so that we can have better IOMMU performance (avoid extra and
useless invalidations)?

With Alexey's proposed patch to have callback ops when the iommu
notifier list adds its first entry and removes its last, any of the
additional overhead to generate notifies when nobody is listening can
be avoided.  These same callbacks would be the ones that need to
generate a hw_error if a notifier is added while running in CM=0.

Not familar with Alexey's patch


Thanks for the pointer. :)

, but is that for VFIO only?

vfio is currently the only user of the iommu notifier, but the
interface is generic, which is how it should (must) be.


I mean, if
we configured CMbit=1, guest kernel will send invalidation request
every time it creates new entries (context entries, or iotlb
entries). Even without VFIO notifiers, guest need to trap into QEMU
and process the invalidation requests. This is avoidable if we are not
using VFIO devices at all (so no need to maintain any mappings),

CM=1 only defines that not-present and invalid entries can be cached,
any changes to existing entries requires an invalidation regardless of
CM.  What you're looking for sounds more like ECAP.C:

Yes, but I guess what I was talking about is CM bit but not ECAP.C.
When we clear/replace one context entry, guest kernel will definitely
send one context entry invalidation to QEMU:

static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 
        if (!iommu)

        clear_context_table(iommu, bus, devfn);
        iommu->flush.flush_context(iommu, 0, 0, 0,
        iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);

... While if we are creating a new one (like attaching a new VFIO
device?), it's an optional behavior depending on whether CM bit is

static int domain_context_mapping_one(struct dmar_domain *domain,
                                      struct intel_iommu *iommu,
                                      u8 bus, u8 devfn)
         * It's a non-present to present mapping. If hardware doesn't cache
         * non-present entry we only need to flush the write-buffer. If the
         * _does_ cache non-present entries, then it does so in the special
         * domain #0, which we have to flush:
        if (cap_caching_mode(iommu->cap)) {
                iommu->flush.flush_context(iommu, 0,
                                           (((u16)bus) << 8) | devfn,
                iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
        } else {

Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
will send these invalidations. What I meant is that, we should allow
user to specify the CM bit, so that when we are not using VFIO
devices, we can skip the above flush_content() and flush_iotlb()
etc... So, besides the truth that we have some guests do not support
CM bit (like Jailhouse), performance might be another consideration
point that we should allow user to specify the CM bit themselfs.

I'm dubious of this, IOMMU drivers are already aware that hardware
flushes are expensive and do batching to optimize it.  The queued
invalidation mechanism itself is meant to allow asynchronous
invalidations.  QEMU invalidating a virtual IOMMU might very well be
faster than hardware.

Do batching doesn't mean we can eliminate the IOTLB flush for mappings from non-present to present, in case of CM=1, while in case CM=0 those IOTLB flush are not necessary, just like the code above shows. Therefore generally speaking CM=0 should have better performance than CM=1, even for Qemu's vIOMMU.

In my understanding the purpose of exposing CM=1 is to force guest do IOTLB flush for each mapping change (including from non-present to present) so Qemu is able to emulate each mapping change from guest (correct me if I was wrong). If previous statement stands, CM=1 is really a workaround for making vfio assigned devices and vIOMMU work together, and unfortunately this cannot work on other vendor's IOMMU without CM bit, such as AMD's IOMMU.

So what's the requirements of making vfio assigned devices and vIOMMU work together? I think it should be more helpful to implement a more generic solution to monitor and emulate guest vIOMMU's page table, rather than simply exposing CM=1 to guest, as it only works on intel IOMMU.

And what do you mean asynchronous invalidations? I think the iova of the changed mappings cannot be used until the mappings are invalidated. It doesn't matter whether the invalidation is done via QI or register.


C: Page-walk Coherency
  This field indicates if hardware access to the root, context,
  extended-context and interrupt-remap tables, and second-level paging
  structures for requests-without PASID, are coherent (snooped) or not.
    • 0: Indicates hardware accesses to remapping structures are non-coherent.
    • 1: Indicates hardware accesses to remapping structures are coherent.

Without both CM=0 and C=0, our only virtualization mechanism for
maintaining a hardware cache coherent with the guest view of the iommu
would be to shadow all of the VT-d structures.  For purely emulated
devices, maybe we can get away with that, but I doubt the current
ghashes used for the iotlb are prepared for it.

Actually I haven't noticed this bit yet. I see that this will decide
whether guest kernel need to send specific clflush() when modifying
IOMMU PTEs, but shouldn't we flush the memory cache always so that we
can sure IOMMU can see the same memory data as CPU does?

I think it would be a question of how much the g_hash code really buys
us in the VT-d code, it might be faster to do a lookup each time if it
means fewer flushes.  Those hashes are useless overhead for assigned
devices, so maybe we can avoid them when we only have assigned
devices ;)  Thanks,


reply via email to

[Prev in Thread] Current Thread [Next in Thread]