Re: [PATCH v13 0/5] UFFD write-tracking migration/snapshots

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v13 0/5] UFFD write-tracking migration/snapshots

From:	David Hildenbrand
Subject:	Re: [PATCH v13 0/5] UFFD write-tracking migration/snapshots
Date:	Thu, 11 Feb 2021 21:44:07 +0100
> Am 11.02.2021 um 21:31 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Thu, Feb 11, 2021 at 08:01:29PM +0100, David Hildenbrand wrote:
>>> On 11.02.21 19:28, Andrey Gruzdev wrote:
>>> On 11.02.2021 20:32, Peter Xu wrote:
>>>> On Thu, Feb 11, 2021 at 07:19:47PM +0300, Andrey Gruzdev wrote:
>>>>> On 09.02.2021 22:06, David Hildenbrand wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> just stumbled over this, quick question:
>>>>>>>> 
>>>>>>>> I recently played with UFFD_WP and notices that write protection is
>>>>>>>> only effective on pages/ranges that have already pages populated (IOW:
>>>>>>>> !pte_none() in the kernel).
>>>>>>>> 
>>>>>>>> In case memory was never populated (or was discarded using e.g.,
>>>>>>>> madvice(DONTNEED)), write-protection will be skipped silently and you
>>>>>>>> won't get WP events for applicable pages.
>>>>>>>> 
>>>>>>>> So if someone writes to a yet unpoupulated page ("zero"), you won't
>>>>>>>> get WP events.
>>>>>>>> 
>>>>>>>> I can spot that you do a single uffd_change_protection() on the whole
>>>>>>>> RAMBlock.
>>>>>>>> 
>>>>>>>> How are you handling that scenario, or why don't you have to handle
>>>>>>>> that scenario?
>>>>>>>> 
>>>>>>> Hi David,
>>>>>>> 
>>>>>>> I really wonder if such a problem exists.. If we are talking about a
>>>>>> I immediately ran into this issue with my simplest test cases. :)
>>>>>> 
>>>>>>> write to an unpopulated page, we should get first page fault on
>>>>>>> non-present page and populate it with protection bits from
>>>>>>> respective vma.
>>>>>>> For UFFD_WP vma's  page will be populated non-writable. So we'll get
>>>>>>> another page fault on present but read-only page and go to
>>>>>>> handle_userfault.
>>>>>> See the attached test program. Triggers for me on 5.11.0-rc6+ and
>>>>>> 5.9.13-200.fc33
>>>>>> 
>>>>>> gcc -lpthread uffdio_wp.c -o uffdio_wp
>>>>>> ./uffdio_wp
>>>>>> WP did not fire
>>>>>> 
>>>>>> Uncomment the placement of the zeropage just before registering to make
>>>>>> the WP actually trigger. If there is no PTE, there is nothing to
>>>>>> protect.
>>>>>> 
>>>>>> 
>>>>>> And it makes sense: How should the fault handler know which ranges you
>>>>>> wp-ed, if there is no place to store that information (the PTEs!). The
>>>>>> VMA cannot tell that story, it only knows that someone registered
>>>>>> UFFD_WP to selectively wp some parts.
>>>>>> 
>>>>>> You might have to register also for MISSING faults and place zero pages.
>>>>>> 
>>>>> Looked at the kernel code, agree that we miss WP events for unpopulated
>>>>> pages, UFFD_WP softbit won't be set in this case. But it doesn't make 
>>>>> saved
>>>>> snapshot inconsistent or introduce security issues. The only side effect 
>>>>> is
>>>>> that we may save updated page instead of zeroed, just increasing snapshot
>>>>> size. However this guest-physical page has never been touched from the 
>>>>> point
>>>>> of view of saved vCPU/device state and is not a concern.
>>>> Oh I just remembered one thing, that Linux should be zeroing pages when
>>>> allocating, so even if the page has legacy content it'll be cleared with
>>>> __GFP_ZERO allocations.  So yeah it would be harder to have issue at least 
>>>> with
>>>> a sensible OS.  I'm not sure about Windows or others, but it could be a 
>>>> common
>>>> case.  Then the only overhead is the extra pages we kept in the live 
>>>> snapshot,
>>>> which takes some more disk space.
>>>> 
>>>> Or there could be firmware running without OS at all, but it should really 
>>>> not
>>>> read unallocated pages assuming there must be zero.  It's not a sane 
>>>> behavior
>>>> even for a firmware.
>>>> 
>>>>> Often (at least on desktop Windows guests) only a small part of RAM has 
>>>>> ever
>>>>> been allocated by guest. Migration code needs to read each guest-physical
>>>>> page, so we'll have a lot of additional UFFD events, much more MISSING
>>>>> events then WP-faults.
>>>>> 
>>>>> And the main problem is that adding MISSING handler is impossible in 
>>>>> current
>>>>> single-threaded snapshot code. We'll get an immediate deadlock on 
>>>>> iterative
>>>>> page read.
>>>> Right.  We'll need to rework the design but just for saving a bunch of 
>>>> snapshot
>>>> image disk size.  So now I agree with you, let's keep this in mind, but 
>>>> maybe
>>>> it isn't worth a fix for now, at least until we figure something really 
>>>> broken.
>>>> 
>>>> Andrey, do you think we should still mention this issue into the todo list 
>>>> of
>>>> the wiki page of live snapshot?
>>>> 
>>>> Thanks,
>>>> 
>>> Yes, even if the page happens to be overwritten, it's overwritten by the 
>>> same VM so
>>> no security boundaries are crossed. And no machine code can assume that RAM 
>>> content
>>> is zeroed on power-on or reset so our snapshot state stays quite consistent.
>>> 
>>> Agree we should keep it in mind, but IMHO adding MISSING handler and 
>>> running separate
>>> thread would make performance worse.. So I doubt it's worth adding this to 
>>> TODO list..
>>> 
>> 
>> So, here is what happens: your snapshot will contain garbage at places where
>> it should contain zero.
>> 
>> This happens when your guest starts using an unpopulated page after
>> snapshotting started and the page has not been copied to the snapshot yet.
>> You won't get a WP event, therefore you cannot copy "zero" to the snapshot
>> before content gets overridden.
>> 
>> If you load your snapshot, it contains garbage at places that are supposed
>> to contain zero.
>> 
>> 
>> There is a feature in virtio-balloon that *depends* on previously discarded
>> pages from staying unmodified in some cases: free page reporting.
>> 
>> The guest will report pages (that might have been initialized to zero) to
>> the hypervisor). The hypervisor will discard page content if the content was
>> initialized to zero. After that, the guest is free to reuse the pages
>> anytime with the guarantee that content has not been modified (-> still is
>> zero).
>> 
>> 
>> See QEMU hw/virtio/virtio-balloon.c: virtio_balloon_handle_report()
>> 
>> "When we discard the page it has the effect of removing the page from the
>> hypervisor itself and causing it to be zeroed when it is returned to us. So
>> we must not discard the page [...] if the guest is expecting it to retain a
>> non-zero value."
>> 
>> And Linux drivers/virtio/virtio_balloon.c: virtballoon_validate()
>> 
>> "Inform the hypervisor that our pages are poisoned or initialized. If we
>> cannot do that then we should disable page reporting as it could potentially
>> change the contents of our free pages."
>> 
>> 
>> It's as simple as having a Linux guest with init_on_free and
>> free-page-reporting active via virtio-balloon.
>> 
>> Long story short, your feature might break guests (when continuing the
>> snapshot) when free page reporting is active. I agree that MISSING events
>> might be too heavy, so you should disable snapshots if free page reporting
>> is active.
> 
> When the page is reported with poison_val set, it is written already by the
> guest, right (either all zeros, or some special marks)?  If so, why it won't 
> be
> wr-protected by uffd?

Let‘s take a look at init-on-free.

The guest zeroes a page and puts it onto a buddy freelist. Free page reporting 
code takes it off that list and reports it to the hypervisor. The hypervisor 
discards the physical page and tells the guest he‘s done processing the page. 
The guest re-places the page onto the free page list.

>From that point on, the page can be re-allocated inside the guest and is 
>assumed to be zero. On access, a fresh (zeroed) page is populated by the 
>hypervisor. The guest won‘t re-zero the page, as it has the guarantee (from 
>free page reporting) that the page remained zero.

Write-protecting the unpopulated page won‘t work as discussed.

> 
> -- 
> Peter Xu
>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH v13 0/5] UFFD write-tracking migration/snapshots, (continued)
Prev by Date: Re: [PATCH 2/2] file-posix: Cache next hole
Next by Date: [PATCH v2 0/4] improve do_strtosz precision
Previous by thread: Re: [PATCH v13 0/5] UFFD write-tracking migration/snapshots
Next by thread: Re: [PATCH v13 0/5] UFFD write-tracking migration/snapshots
Index(es):
- Date
- Thread