qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC] memory: pause all vCPUs for the duration of memory trans


From: David Hildenbrand
Subject: Re: [PATCH RFC] memory: pause all vCPUs for the duration of memory transactions
Date: Tue, 27 Oct 2020 14:08:48 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0

On 27.10.20 14:02, Vitaly Kuznetsov wrote:
David Hildenbrand <david@redhat.com> writes:

On 27.10.20 13:36, Vitaly Kuznetsov wrote:
David Hildenbrand <david@redhat.com> writes:

On 26.10.20 11:43, David Hildenbrand wrote:
On 26.10.20 09:49, Vitaly Kuznetsov wrote:
Currently, KVM doesn't provide an API to make atomic updates to memmap when
the change touches more than one memory slot, e.g. in case we'd like to
punch a hole in an existing slot.

Reports are that multi-CPU Q35 VMs booted with OVMF sometimes print something
like

!!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000003 !!!!
ExceptionData - 0000000000000010  I:1 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
RIP  - 000000007E35FAB6, CS  - 0000000000000038, RFLAGS - 0000000000010006
RAX  - 0000000000000000, RCX - 000000007E3598F2, RDX - 00000000078BFBFF
...

The problem seems to be that TSEG manipulations on one vCPU are not atomic
from other vCPUs views. In particular, here's the strace:

Initial creation of the 'problematic' slot:

10085 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, 
guest_phys_addr=0x100000,
     memory_size=2146435072, userspace_addr=0x7fb89bf00000}) = 0

... and then the update (caused by e.g. mch_update_smram()) later:

10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, 
guest_phys_addr=0x100000,
     memory_size=0, userspace_addr=0x7fb89bf00000}) = 0
10090 ioctl(13, KVM_SET_USER_MEMORY_REGION, {slot=6, flags=0, 
guest_phys_addr=0x100000,
     memory_size=2129657856, userspace_addr=0x7fb89bf00000}) = 0

In case KVM has to handle any event on a different vCPU in between these
two calls the #PF will get triggered.

An ideal solution to the problem would probably require KVM to provide a
new API to do the whole transaction in one shot but as a band-aid we can
just pause all vCPUs to make memory transations atomic.

Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC: Generally, memap updates happen only a few times during guest boot but
I'm not sure there are no scenarios when pausing all vCPUs is undesireable
from performance point of view. Also, I'm not sure if kvm_enabled() check
is needed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
   softmmu/memory.c | 11 +++++++++--
   1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index fa280a19f7f7..0bf6f3f6d5dc 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -28,6 +28,7 @@
#include "exec/memory-internal.h"
   #include "exec/ram_addr.h"
+#include "sysemu/cpus.h"
   #include "sysemu/kvm.h"
   #include "sysemu/runstate.h"
   #include "sysemu/tcg.h"
@@ -1057,7 +1058,9 @@ static void address_space_update_topology(AddressSpace 
*as)
   void memory_region_transaction_begin(void)
   {
       qemu_flush_coalesced_mmio_buffer();
-    ++memory_region_transaction_depth;
+    if ((++memory_region_transaction_depth == 1) && kvm_enabled()) {
+        pause_all_vcpus();
+    }
   }
void memory_region_transaction_commit(void)
@@ -1087,7 +1090,11 @@ void memory_region_transaction_commit(void)
               }
               ioeventfd_update_pending = false;
           }
-   }
+
+        if (kvm_enabled()) {
+            resume_all_vcpus();
+        }
+    }
   }
static void memory_region_destructor_none(MemoryRegion *mr)


This is in general unsafe. pause_all_vcpus() will temporarily drop the
BQL, resulting in bad things happening to caller sites.

Oh, I see, thanks! I was expecting there's a reason we don't have this
simple fix in already :-)


I studies the involved issues quite intensively when wanting to resize
memory regions from virtio-mem code. It's not that easy.

Have a look at my RFC for resizing. You can apply something similar to
other operations.

https://www.mail-archive.com/qemu-devel@nongnu.org/msg684979.html

Oh, and I even mentioned the case you try to fix here back then

"
Instead of inhibiting during the region_resize(), we could inhibit for the
hole memory transaction (from begin() to commit()). This could be nice,
because also splitting of memory regions would be atomic (I remember there
was a BUG report regarding that), however, I am not sure if that might
impact any RT users.
"

The current patches live in
https://github.com/davidhildenbrand/qemu/commits/virtio-mem-next

Especially

https://github.com/davidhildenbrand/qemu/commit/433fbb3abed20f15030e42f2b2bea7e6b9a15180



I'm not sure why we're focusing on ioctls here. I was debugging my case
quite some time ago but from what I remember it had nothing to do with
ioctls from QEMU. When we are removing a memslot any exit to KVM may
trigger an error condition as we'll see that vCPU or some of our
internal structures (e.g. VMCS for a nested guest) references
non-existent memory. I don't see a good solution other than making the
update fully atomic from *all* vCPUs point of view and this requires
stopping all CPUs -- either from QEMU or from KVM.

I cannot follow. My patch waits until *any* KVM ioctls are out of the
kernel. That includes VCPUs, but also other ioctls (because there are
some that require a consistent memory block state).

So from a KVM point of view, the CPUs are stopped.

Sorry for not being clear: your patch looks good to me, what I tried to
say is that with the current KVM API the only way to guarantee atomicity
of the update is to make vCPUs stop (one way or another), kicking them
out and preventing new IOCTLs from being dispatched is one way
(temporary pausing them inside KVM would be another, for example -- but
that would require *new* API supplying the whole transaction and not one
memslot update).

Ah, got it.

Yes - and I briefly looked into resizing slots inside KVM atomically and it already turned out to be a major pain. All that metadata that's allocated for a memory slot based on the size is problematic.

Same applies to all other kinds of operations (splitting, punching out, ...) as you also mentioned.

--
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]