[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON
From: |
Michael S. Tsirkin |
Subject: |
Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT |
Date: |
Wed, 7 Jul 2021 15:57:51 -0400 |
On Wed, Jul 07, 2021 at 09:47:31PM +0200, David Hildenbrand wrote:
> On 07.07.21 21:19, Michael S. Tsirkin wrote:
> > On Wed, Jul 07, 2021 at 09:14:00PM +0200, David Hildenbrand wrote:
> > > On 07.07.21 21:05, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 07, 2021 at 04:06:55PM +0200, David Hildenbrand wrote:
> > > > > Postcopy never worked properly with 'free-page-hint=on', as there are
> > > > > at least two issues:
> > > > >
> > > > > 1) With postcopy, the guest will never receive a
> > > > > VIRTIO_BALLOON_CMD_ID_DONE
> > > > > and consequently won't release free pages back to the OS once
> > > > > migration finishes.
> > > > >
> > > > > The issue is that for postcopy, we won't do a final bitmap sync
> > > > > while
> > > > > the guest is stopped on the source and
> > > > > virtio_balloon_free_page_hint_notify() will only call
> > > > > virtio_balloon_free_page_done() on the source during
> > > > > PRECOPY_NOTIFY_CLEANUP, after the VM state was already migrated
> > > > > to
> > > > > the destination.
> > > > >
> > > > > 2) Once the VM touches a page on the destination that has been
> > > > > excluded
> > > > > from migration on the source via qemu_guest_free_page_hint()
> > > > > while
> > > > > postcopy is active, that thread will stall until postcopy
> > > > > finishes
> > > > > and all threads are woken up. (with older Linux kernels that
> > > > > won't
> > > > > retry faults when woken up via userfaultfd, we might actually
> > > > > get a
> > > > > SEGFAULT)
> > > > >
> > > > > The issue is that the source will refuse to migrate any pages
> > > > > that
> > > > > are not marked as dirty in the dirty bmap -- for example,
> > > > > because the
> > > > > page might just have been sent. Consequently, the faulting
> > > > > thread will
> > > > > stall, waiting for the page to be migrated -- which could take
> > > > > quite
> > > > > a while and result in guest OS issues.
> > > >
> > > > OK so if source gets a request for a page which is not dirty
> > > > it does not respond immediately? Why not just teach it to
> > > > respond? It would seem that if destination wants a page we
> > > > should just give it to the destination ...
> > >
> > > The source does not know if a page has already been sent (e.g., via the
> > > background migration thread that moves all data over) vs. the page has not
> > > been send because the page was hinted. This is the part where we'd need
> > > additional tracking on the source to actually know that.
> > >
> > > We must not send a page twice, otherwise bad things can happen when
> > > placing
> > > pages that already have been migrated, because that scenario can easily
> > > happen with ordinary postcopy (page has already been sent and we're
> > > dealing
> > > with a stale request from the destination).
> >
> > OK let me get this straight
> >
> > A. source sends page
> > B. destination requests page
> > C. destination changes page
> > D. source sends page
> > E. destination overwrites page
> >
> > this is what you are worried about right?
>
> IIRC E. is with recent kernels:
>
> E. placing the page fails with -EEXIST and postcopy migration fails
>
> However, the man page (man ioctl_userfaultfd) doesn't describe what is
> actually supposed to happen when double-placing. Could be that it's
> "undefined behavior".
>
> I did not try, though.
>
>
> This is how it works today:
>
> A. source sends page and marks it clean
> B. destination requests page
> C. destination receives page and places it
> D. source ignores request as page is clean
If it's actually -EEXIST then we could just resend it
and teach destination to ignore -EEXIST errors right?
Will actually make things a bit more robust: destination
handles its own consistency instead of relying on source.
> >
> > the fix is to mark page clean in A.
> > then in D to not send page if it's clean?
> >
> > And the problem with hinting is this:
> >
> > A. page is marked clean
> > B. destination requests page
> > C. destination changes page
> > D. source sends page <- does not happen, page is clean!
> > E. destination overwrites page
>
> Simplified it's
>
> A. page is marked clean by hinting code
> B. destination requests page
> D. source ignores request as page is clean
> E. destination stalls until postcopy unregisters uffd
>
>
> Some thoughts
>
> 1. We do have a a recv bitmap where we track received pages on the
> destination (e.g., ramblock_recv_bitmap_test()), however we only use it to
> avoid sending duplicate requests to the hypervisor AFAIKs, and don't check
> it when placing pages.
>
> 2. Changing the migration behavior unconditionally on the source will break
> migration to old QEMU binaries that cannot handle this change.
We can always make this depend on new machine types.
> 3. I think the current behavior is in place to make debugging easier. If
> only a single instance of a page will ever be migrated from source to
> destination, there cannot be silent data corruption. Further, we avoid
> migrating unnecessarily pages twice.
>
Likely does not matter much for performance, it seems unlikely that
the race is all that common.
> Maybe Dave and Peter can spot any flaws in my understanding.
>
> --
> Thanks,
>
> David / dhildenb
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, (continued)
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, Michael S. Tsirkin, 2021/07/07
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, Peter Xu, 2021/07/07
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, Alexander Duyck, 2021/07/07
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, Peter Xu, 2021/07/07
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, David Hildenbrand, 2021/07/08
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, David Hildenbrand, 2021/07/08
Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, Michael S. Tsirkin, 2021/07/07
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, David Hildenbrand, 2021/07/07
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, Michael S. Tsirkin, 2021/07/07
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, David Hildenbrand, 2021/07/07
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT,
Michael S. Tsirkin <=
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, David Hildenbrand, 2021/07/08
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, Dr. David Alan Gilbert, 2021/07/08
- Re: [PATCH v1 2/2] virtio-balloon: disallow postcopy with VIRTIO_BALLOON_F_FREE_PAGE_HINT, Michael S. Tsirkin, 2021/07/09