[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM

From: Juan Quintela
Subject: Re: [Qemu-block] [PATCH] migration: flush the bdrv before stopping VM
Date: Wed, 18 Mar 2015 13:36:03 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (gnu/linux)

Kevin Wolf <address@hidden> wrote:
> [ Cc: qemu-block ]
> Am 18.03.2015 um 04:19 hat Li, Liang Z geschrieben:
>> > This needs further review/changes on the block layer.
>> > 
>> > First explanation, why I think this don't fix the full problem.
>> > Whith this patch, we fix the problem where we have a dirty block layer but
>> > basically nothing dirtying the memory on the guest (we are moving the 20
>> > seconds from max_downtime for the blocklayer flush), to 20 seconds until
>> > we have decided that the amount of dirty memory is small enough to be
>> > transferred during max_downtime.  But it is still going to take 20 seconds 
>> > to
>> > flush the block layer, and during that 20 seconds, the amount of memory 
>> > that
>> > can be dirty is HUGE.
>> It's true.
> What kind of cache is it actually that takes 20s to flush here?

That is a very good question.   When I meassured this (long, long ago),
testing with the same workload, bdrv_flush_all() could take form
100-300ms (what I expected and I can live with that), to several
seconds, what I can't live with.

Basically (this was around RHEL6 times, whatever upstream were there at
the time), my notes of the time point to:

        ret = select(max_fd, &rdfds, &wrfds, NULL, NULL);

notice that we are doing a select without a timeout in the iothread,

I know that the code has changed a lot on that area, the select() don't
exist anymore.  But this exemplifies the problem, something asks the
block layer to do an operation, and it blocks until it finishes, even if
this time it is taking more than usual for whatever reason.  What I
would really is a way to be able to bdrv_flush_all() to take a timeout
parameter and return an error case that it is taking too long.

> Inside of qemu, we don't cache a whole lot. If you use qcow2, we do use
> a writeback cache for some metadata that might need to be written back,
> but it is small and shouldn't take any significant time.
> Then we have the kernel page cache, or for some network protocols caches
> in the respective libs. This sounds like the right size for a 20s stall,
> but don't we require cache.direct=on for live migration anyway for
> coherency, i.e. bypassing any such cache?
> Lastly there may be a disk cache, but it's too small either.

>> > I think our ouptions are:
>> > 
>> > - tell the block layer at the beggining of migration
>> >   Hey, we are migrating, could you please start flusing data now, and
>> >   don't get the caches to grow too much, please, pretty please.
>> >   (I left the API to the block layer)
>> > - Add on that point a new function:
>> >   bdrvr_flush_all_start()
>> >   That starts the sending of pages, and we "hope" that by the time that
>> >   we have migrated all memory, they have also finished (so our last
>> >   call to block_flush_all() have less work to do)
>> > - Add another function:
>> >   int bdrv_flush_all_timeout(int timeout)
>> >   that returns if timeout pass, telling if it has migrated all pages or
>> >   timeout has passed.  So we can got back to the iterative stage if it
>> >   has taken too long.
> The problem is that the block layer really doesn't have an option to
> control what is getting synced once the data is cached outside of qemu.
> Essentially we can do an fdatasync() or we can leave it, that's the only
> choice we have.

See my explanation, if all that qemu is doing is an fdatasync(), just
spawn a thread that do the fdatasync() and if the thread don't finish in
<timeout> time, just return one error.  You can implement that behaviour
whatever you want.

> Now doing an asynchronous fdatasync() in the background is completely
> reasonable in order to reduce the amount of data to be flushed later.
> But the patch is doing it while holding both the BQL and the AIOContext
> lock of the device, which doesn't feel right. Maybe it should schedule a
> BH in the AIOContext instead and flush from there asynchronously.

Position is wrong, definitelly.  We want to start the asynchronous
fdatasync() at the start of migration, or each X milliseconds.  At this
point, we *think* that we can finish the migration on max_downtime
(basically we are ignoring the time that is going to take to migrate
device state and do the block layer flush, but normally this takes in
the other of 100-200ms, so it don't matter at all).

> The other thing is that flushing once doesn't mean that new data isn't
> accumulating in the cache, especially if you decide to do the background
> flush at the start of the migration.

But "pontentially", we would arrive to this point with less "cached"
data everything on the system.

> The obvious way to avoid that would be to switch to a writethrough mode,
> so any write request goes directly to the disk. This will, however,
> impact performance so heavily that it's probably not a realistic option.
> An alternative approach could be to repeat the background flush
> periodically, either time based or after every x bytes that are written
> to a device. Time based should actually be quite easy to implement.

We can do it periodically on the migration thread, if the call is
thread_safe.  We already have a loop there, and "kind" of time control,
what we miss is an interface.

> Once we have the flushing in the background, the migration code can
> apply any timeouts it wants. If the timeout is exceeded, the flush
> wouldn't be aborted but keep running in the background, but migration
> can go back to the iterative state anyway.

Yeap, that is what we really want/need.

>> > Notice that *normally* bdrv_flush_all() is very fast, the problem is that
>> > sometimes it get really, really slow (NFS decided to go slow, TCP drop a
>> > packet, whatever).
>> > 
>> > Right now, we don't have an interface to detect that cases and got back to
>> > the iterative stage.
>> How about go back to the iterative stage when detect that the pending_size 
>> is larger
>> Than max_size, like this:
>> +                /* do flush here is aimed to shorten the VM downtime,
>> +                 * bdrv_flush_all is a time consuming operation
>> +                 * when the guest has done some file writing */
>> +                bdrv_flush_all();
>> +                pending_size = qemu_savevm_state_pending(s->file, max_size);
>> +                if (pending_size && pending_size >= max_size) {
>> +                    qemu_mutex_unlock_iothread();
>> +                    continue;
>> +                }
>>                   ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>>                   if (ret >= 0) {
>>                       qemu_file_set_rate_limit(s->file, INT64_MAX);
>> and this is quite simple.
> Yes, but it is too simple. If you hold all the locks during
> bdrv_flush_all(), your VM will effectively stop as soon as it performs
> the next I/O access, so you don't win much. And you still don't have a
> timeout for cases where the flush takes really long.

This is probably better than what we had now (basically we are
"meassuring" after bdrv_flush_all how much the amount of dirty memory
has changed, and return to iterative stage if it took too much.  A
timeout would be better anyways.  And an interface te start the
synchronization sooner asynchronously would be also good.

Notice that my understanding is that any proper fix for this is 2.4 material.

Thanks, Juan.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]