[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Question on dirty sync before kvm memslot removal

From: Peter Xu
Subject: Re: Question on dirty sync before kvm memslot removal
Date: Tue, 31 Mar 2020 12:51:33 -0400

On Tue, Mar 31, 2020 at 05:34:43PM +0200, Paolo Bonzini wrote:
> On 31/03/20 17:23, Peter Xu wrote:
> > Yeah, I used a new flag because I thought READONLY was a bit tricky to
> > be used directly here.  The thing is IIUC if guest writes to a
> > READONLY slot then KVM should either ignore the write or trigger an
> > error which I didn't check, however here what we want to do is to let
> > the write to fallback to the userspace so it's neither dropped (we
> > still want the written data to land gracefully on RAM), nor triggering
> > an error (because the slot is actually writable).
> No, writes fall back to userspace with KVM_MEM_READONLY.

I read that __kvm_write_guest_page() will return -EFAULT when writting
to the read-only memslot, and e.g. kvm_write_guest_virt_helper() will
return with X86EMUL_IO_NEEDED, which will be translated into a
EMULATION_OK in x86_emulate_insn().  Then in x86_emulate_instruction()
it seems to get a "1" returned (note that I think it does not set
either vcpu->arch.pio.count or vcpu->mmio_needed).  Does that mean
it'll retry the write forever instead of quit into the userspace?  I
may possibly have misread somewhere, though..

> >> The problem here is also that the GFN is not an unique identifier of the
> >> QEMU ram_addr_t.  However you don't really need to kick all vCPUs out,
> >> do you?  You can protect the dirty ring with its own per-vCPU mutex and
> >> harvest the pages from the main thread.
> > I'm not sure I get the point, but just to mention that currently the
> > dirty GFNs are collected in a standalone thread (in the QEMU series
> > it's called the reaper thread) rather than in the per vcpu thread
> > because the KVM_RESET_DIRTY_RINGS is per-vm after all.  One major
> > reason to kick the vcpus is to make sure the hardware cached dirty
> > GFNs (i.e. PML) are flushed synchronously.
> But you're referring to KVM kicking vCPUs not qemu_vcpu_kick.  Can you
> just do an iteration of reaping after setting KVM_MEM_READONLY?

Oh I think you're right, deleting & readding memslot is special here
that we can keep the data in PML untouched, as long as they can be
flushed again to the new bitmap we're going to create.

However... I think I might find another race with this:

          main thread                       vcpu thread
          -----------                       -----------
                                            dirty GFN1, cached in PML
          remove memslot1 of GFN1
            set slot READONLY (whatever, or INVALID)
            sync log (NOTE: no GFN1 yet)
                                            vmexit, flush PML with RCU
                                            (will flush to old bitmap) <------- 
            delete memslot1 (old bitmap freed)                         <------- 
          add memslot2 of GFN1 (memslot2 could be smaller)
            add memslot2

I'm not 100% sure, but I think GFN1's dirty bit will be lost though
it's correctly applied at [1] but quickly freed at [2].

> > I think the whole kick operation is indeed too heavy for this when
> > with the run_on_cpu() trick, because the thing we want to know (pml
> > flushing) is actually per-vcpu and no BQL interaction. Do we have/need
> > a lightweight way to kick one vcpu in synchronous way?  I was
> > wondering maybe something like responding a "sync kick" request in the
> > vcpu thread right after KVM_RUN ends (when we don't have BQL yet).
> > Would that make sense?
> Not synchronously, because anything synchronous is very susceptible to
> deadlocks.

Yeah it's easy to deadlock (I suffer from it...), but besides above
case (which I really think it's special) I still think unluckily we
need a synchronous way.  For example, the VGA code will need the
latest dirty bit information to decide whether to update the screen
(or it could stall), or the migration code where we need to calculate
downtime with the current dirty bit information, etc.

Peter Xu

reply via email to

[Prev in Thread] Current Thread [Next in Thread]