qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Live migration without bdrv_drain_all()


From: Juan Quintela
Subject: Re: [Qemu-devel] Live migration without bdrv_drain_all()
Date: Wed, 28 Sep 2016 11:03:15 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux)

"Dr. David Alan Gilbert" <address@hidden> wrote:
> * Stefan Hajnoczi (address@hidden) wrote:
>> On Mon, Aug 29, 2016 at 06:56:42PM +0000, Felipe Franciosi wrote:
>> > Heya!
>> > 
>> > > On 29 Aug 2016, at 08:06, Stefan Hajnoczi <address@hidden> wrote:
>> > > 
>> > > At KVM Forum an interesting idea was proposed to avoid
>> > > bdrv_drain_all() during live migration.  Mike Cui and Felipe Franciosi
>> > > mentioned running at queue depth 1.  It needs more thought to make it
>> > > workable but I want to capture it here for discussion and to archive
>> > > it.
>> > > 
>> > > bdrv_drain_all() is synchronous and can cause VM downtime if I/O
>> > > requests hang.  We should find a better way of quiescing I/O that is
>> > > not synchronous.  Up until now I thought we should simply add a
>> > > timeout to bdrv_drain_all() so it can at least fail (and live
>> > > migration would fail) if I/O is stuck instead of hanging the VM.  But
>> > > the following approach is also interesting...
>> > > 
>> > > During the iteration phase of live migration we could limit the queue
>> > > depth so points with no I/O requests in-flight are identified.  At
>> > > these points the migration algorithm has the opportunity to move to
>> > > the next phase without requiring bdrv_drain_all() since no requests
>> > > are pending.
>> > 
>> > I actually think that this "io quiesced state" is highly unlikely
>> > to _just_ happen on a busy guest. The main idea behind running at
>> > QD1 is to naturally throttle the guest and make it easier to
>> > "force quiesce" the VQs.
>> > 
>> > In other words, if the guest is busy and we run at QD1, I would
>> > expect the rings to be quite full of pending (ie. unprocessed)
>> > requests. At the same time, I would expect that a call to
>> > bdrv_drain_all() (as part of do_vm_stop()) should complete much
>> > quicker.
>> > 
>> > Nevertheless, you mentioned that this is still problematic as that
>> > single outstanding IO could block, leaving the VM paused for
>> > longer.
>> > 
>> > My suggestion is therefore that we leave the vCPUs running, but
>> > stop picking up requests from the VQs. Provided nothing blocks,
>> > you should reach the "io quiesced state" fairly quickly. If you
>> > don't, then the VM is at least still running (despite seeing no
>> > progress on its VQs).
>> > 
>> > Thoughts on that?
>> 
>> If the guest experiences a hung disk it may enter error recovery.  QEMU
>> should avoid this so the guest doesn't remount file systems read-only.
>> 
>> This can be solved by only quiescing the disk for, say, 30 seconds at a
>> time.  If we don't reach a point where live migration can proceed during
>> those 30 seconds then the disk will service requests again temporarily
>> to avoid upsetting the guest.
>> 
>> I wonder if Juan or David have any thoughts from the live migration
>> perspective?
>
> Throttling IO to reduce the time in the final drain makes sense
> to me, however:
>    a) It doesn't solve the problem if the IO device dies at just the wrong 
> time,
>       so you can still get that hang in bdrv_drain_all
>
>    b) Completely stopping guest IO sounds too drastic to me unless you can
>       time it to be just at the point before the end of migration; that feels
>       tricky to get right unless you can somehow tie it to an estimate of
>       remaining dirty RAM (that never works that well).
>
>    c) Something like a 30 second pause still feels too long; if that was
>       a big hairy database workload it would effectively be 30 seconds
>       of downtime.
>
> Dave

I think something like the proposed thing could work.

We can put queue depth = 1 or somesuch when we know we are near
completion for migration.  What we need them is a way to call the
equivalent of:

bdrv_drain_all() to return EAGAIN or EBUSY if it is a bad moment.  In
that case, we just do another round over the whole memory, or retry in X
seconds.  Anything is good for us, we just need a way to ask for the
operation but that it don't block.

Notice that migration is the equivalent of:

while (true) {
     write_some_dirty_pages();
     if (dirty_pages < threshold) {
        break;
     }
}
bdrv_drain_all();
write_rest_of_dirty_pages();

(Lots and lots of details ommited)

What we really want is to issue the call of bdrv_drain_all() equivalent
inside the while, so, if there is any problem, we just do another cycle,
no problem.

Later, Juan.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]