[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty

From: Kirti Wankhede
Subject: Re: [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.
Date: Tue, 24 Mar 2020 15:15:07 +0530
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0

On 3/24/2020 8:31 AM, Yan Zhao wrote:
On Tue, Mar 24, 2020 at 02:51:14AM +0800, Dr. David Alan Gilbert wrote:
* Alex Williamson (address@hidden) wrote:
On Mon, 23 Mar 2020 23:24:37 +0530
Kirti Wankhede <address@hidden> wrote:

On 3/21/2020 12:29 AM, Alex Williamson wrote:
On Sat, 21 Mar 2020 00:12:04 +0530
Kirti Wankhede <address@hidden> wrote:
On 3/20/2020 11:31 PM, Alex Williamson wrote:
On Fri, 20 Mar 2020 23:19:14 +0530
Kirti Wankhede <address@hidden> wrote:
On 3/20/2020 4:27 AM, Alex Williamson wrote:
On Fri, 20 Mar 2020 01:46:41 +0530
Kirti Wankhede <address@hidden> wrote:

+static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
+                                 size_t size, uint64_t pgsize,
+                                 u64 __user *bitmap)
+       struct vfio_dma *dma;
+       unsigned long pgshift = __ffs(pgsize);
+       unsigned int npages, bitmap_size;
+       dma = vfio_find_dma(iommu, iova, 1);
+       if (!dma)
+               return -EINVAL;
+       if (dma->iova != iova || dma->size != size)
+               return -EINVAL;
+       npages = dma->size >> pgshift;
+       bitmap_size = DIRTY_BITMAP_BYTES(npages);
+       /* mark all pages dirty if all pages are pinned and mapped. */
+       if (dma->iommu_mapped)
+               bitmap_set(dma->bitmap, 0, npages);
+       if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size))
+               return -EFAULT;

We still need to reset the bitmap here, clearing and re-adding the
pages that are still pinned.


I thought you agreed on my reply to it
> Why re-populate when there will be no change since
    > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any
    > pin request while vfio_iova_dirty_bitmap() is still working, it will
    > wait till iommu->lock is released. Bitmap will be populated when page is
    > pinned.

As coded, dirty bits are only ever set in the bitmap, never cleared.
If a page is unpinned between iterations of the user recording the
dirty bitmap, it should be marked dirty in the iteration immediately
after the unpinning and not marked dirty in the following iteration.
That doesn't happen here.  We're reporting cumulative dirty pages since
logging was enabled, we need to be reporting dirty pages since the user
last retrieved the dirty bitmap.  The bitmap should be cleared and
currently pinned pages re-added after copying to the user.  Thanks,

Does that mean, we have to track every iteration? do we really need that

Generally the flow is:
- vendor driver pin x pages
- Enter pre-copy-phase where vCPUs are running - user starts dirty pages
tracking, then user asks dirty bitmap, x pages reported dirty by
- In pre-copy phase, vendor driver pins y more pages, now bitmap
consists of x+y bits set
- In pre-copy phase, vendor driver unpins z pages, but bitmap is not
updated, so again bitmap consists of x+y bits set.
- Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped
- user asks dirty bitmap - Since here vCPU and mdev devices are stopped,
pages should not get dirty by guest driver or the physical device.
Hence, x+y dirty pages would be reported.

I don't think we need to track every iteration of bitmap reporting.

Yes, once a bitmap is read, it's reset.  In your example, after
unpinning z pages the user should still see a bitmap with x+y pages,
but once they've read that bitmap, the next bitmap should be x+y-z.
Userspace can make decisions about when to switch from pre-copy to
stop-and-copy based on convergence, ie. the slope of the line recording
dirty pages per iteration.  The implementation here never allows an
inflection point, dirty pages reported through vfio would always either
be flat or climbing.  There might also be a case that an iommu backed
device could start pinning pages during the course of a migration, how
would the bitmap ever revert from fully populated to only tracking the
pinned pages?  Thanks,

At KVM forum we discussed this - if guest driver pins say 1024 pages
before migration starts, during pre-copy phase device can dirty 0 pages
in best case and 1024 pages in worst case. In that case, user will
transfer content of 1024 pages during pre-copy phase and in
stop-and-copy phase also, that will be pages will be copied twice. So we
decided to only get dirty pages bitmap at stop-and-copy phase. If user
is going to get dirty pages in stop-and-copy phase only, then that will
be single iteration.
There aren't any devices yet that can track sys memory dirty pages. So
we can go ahead with this patch and support for dirty pages tracking
during pre-copy phase can be added later when there will be consumers of
that functionality.

So if I understand this right, you're expecting the dirty bitmap to
accumulate dirty bits, in perpetuity, so that the user can only
retrieve them once at the end of migration?  But if that's the case,
the user could simply choose to not retrieve the bitmap until the end
of migration, the result would be the same.  What we have here is that
dirty bits are never cleared, regardless of whether the user has seen
them, which is wrong.  Sorry, we had a lot of discussions at KVM forum,
I don't recall this specific one 5 months later and maybe we weren't
considering all aspects.  I see the behavior we have here as incorrect,
but it also seems relatively trivial to make correct.  I hope the QEMU
code isn't making us go through all this trouble to report a dirty
bitmap that gets thrown away because it expects the final one to be
cumulative since the beginning of dirty logging.  Thanks,

I remember the discussion that we couldn't track the system memory
dirtying with current hardware; so the question then is just to track

hi Dave
there are already devices that are able to track the system memory,
through two ways:
(1) software method. like VFs for "Intel(R) Ethernet Controller XL710 Family
(2) hardware method. through hardware internal buffer (as one Intel
internal hardware not yet to public, but very soon) or through VTD-3.0

we have already had code verified using the two ways to track system memory
in fine-grained level.

what has been pinned and then ideally put that memory off until the end.
(Which is interesting because I don't think we currently have  a way
to delay RAM pages till the end in qemu).

I think the problem here is that we mixed pinned pages with dirty pages.
yes, pinned pages for mdev devices are continuously likely to be dirty
until device stopped.
But for devices that are able to report dirty pages, dirtied pages
will be marked again if hardware writes them later.

So, is it good to introduce a capability to let vfio/qemu know how to
treat the dirty pages?
(1) for devices have no fine-grained dirty page tracking capability
   a. pinned pages are regarded as dirty pages. they are not cleared by
   dirty page query
   b. unpinned pages are regarded as dirty pages. they are cleared by
   dirty page query or UNMAP ioctl.
(2) for devices that have fine-grained dirty page tracking capability
    a. pinned/unpinned pages are not regarded as dirty pages
    b. only pages they reported are regarded as dirty pages and are to be
    cleared by dirty page query and UNMAP ioctl.
(3) for dirty pages marking APIs, like vfio_dma_rw()...
    pages marked by them are regared as dirty and are to be cleared by
    dirty page query and UNMAP ioctl

For (1), qemu VFIO only reports dirty page amount and would not transfer
those pages until last round.

(1) and (3) can be implemented now. I don't think QEMU have support to delay certain marked pages dirty as of now. If QEMU queries dirty pages bitmap in pre-copy phase, all pinned pages reported as dirty will be transferred in pre-copy and again in stop-and-copy phase.

for (2) and (3), qemu VFIO should report and transfer them in each

(2) can be added later


[I still worry whether migration will be usable with any
significant amount of system ram that's pinned in this way; the
downside will very easily get above the threshold that people like]

yes. that's why we have to do multi-round dirty page query and
transfer and clear the dirty bitmaps in each round for devices that are
able to track in fine grain.
and that's why we have to report the amount of dirty pages before
stop-and-copy phase for mdev devices, so that people are able to know
the real downtime as much as possible.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]