[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 7/7] migration: call qemu_savevm_state_pending_exact() with the

From: Joao Martins
Subject: Re: [RFC 7/7] migration: call qemu_savevm_state_pending_exact() with the guest stopped
Date: Fri, 14 Oct 2022 11:36:51 +0100

On 13/10/2022 17:08, Juan Quintela wrote:
> Oops.  My understanding was that once the guest is stopped you can say
> how big is it. 

It's worth keeping in mind that conceptually a VF won't stop (e.g. DMA) until
really told through VFIO. So, stopping CPUs (guest) just means that guest code
does not arm the VF for more I/O but still is a weak guarantee as VF still has
outstanding I/O to deal with until VFIO tells it to quiesce DMA (for devices
that support it).

> That is a deal breaker.  We can't continue if we don't
> know the size.  Copying the whole state to know the size looks too much.

It's fair to say that the kernel could know the size of the state once the VF is
stopped but right now it is only on the STOP -> STOP_COPY arc (which is when it
is saved to kernel pages, regardless of userspace reading it). And it will block
read() until device has finished transferring it (unless you do a nonblocking
read ofc). Knowing the device state would need to be reflected in the vfio UAPI
and needs that adjustment. Providing total length ahead of device transfer might
depend on the hardware, but current vfio vendor drivers look capable of that
(from looking at the code).

>> It would need a @data_size in addition to @data_fd in
>> vfio_device_feature_mig_state, or getting fseek supported over the fd
> Ok, we can decided how to do it, but I think that putting fseek into it
> will only make things more complicated.

fseek() was just a suggestion as a way to calculate file length (device state
length) with current file APIs:

        fseek(data_fd, 0L, SEEK_END);
        size = ftell(data_fd);

@data_size way is likely better (or simpler), but it would need to get an extra

I am sure there are other alternatives

>>> So we need to stop the vm "before" we
>>> cal the functions.
>>> It is a hack because:
>>> - we are "starting" the guest again to stop it in migration_complete()
>>>   I know, I know, but it is not trivial to get all the information
>>>   easily to migration_complete(), so this hack.
>> Could you expand on that? Naively, I was assuming that by 'all information' 
>> the
>> locally stored counters in migration_iteration_run() that aren't present in
>> MigrateState?
> This is not related to vfio at all.
> > The problem is that current code assumes that the guest is still
> running, and then do qemu_mutex_lock_iothread() and unlock() inside the
> pending handlers.  To stop the guest, we need to lock the iothread
> first.  So this was going to hang.  I fixed it doing the lock/unlock
> twice.  That is the hack.  I will change the callers once that we decide
> what is the right path ahead.  This is not a problem related to vfio. it
> is a problem related about how we can stop/resume guests programatically
> in the middle of qemu code.
/me nods

>>> - auto_converge test fails with this hack.  I think that it is related
>>>   to previous problem.  We start the guest when it is supposed to be
>>>   stopped for convergence reasons.
>>> - All experiments that I did to do the proper thing failed with having
>>>   the iothread_locked() or try to unlock() it when not locked.
>>> - several of the pending functions are using the iothread lock
>>>   themselves, so I need to split it to have two versions (one for the
>>>   _estimate() case with the iothread lock), and another for the
>>>   _exact() case without the iothread_lock().  I want comments about
>>>   this approach before I try to continue on this direction.
>>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>>> ---
>>>  migration/migration.c        | 13 +++++++++++++
>>>  tests/qtest/migration-test.c |  3 ++-
>>>  2 files changed, 15 insertions(+), 1 deletion(-)
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 35e512887a..7374884818 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -3742,7 +3742,20 @@ static MigIterateState 
>>> migration_iteration_run(MigrationState *s)
>>>      trace_migrate_pending_estimate(pending_size, s->threshold_size, 
>>> pend_pre, pend_post);
>>>      if (pend_pre <= s->threshold_size) {
>>> +        int old_state = s->state;
>>> +        qemu_mutex_lock_iothread();
>>> +        // is this really necessary?  it works for me both ways.
>>> +        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>>> +        s->vm_was_running = runstate_is_running();
>>> +        vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>>> +        qemu_mutex_unlock_iothread();
>>>          qemu_savevm_state_pending_exact(&pend_pre, &pend_post);
>>> +        qemu_mutex_lock_iothread();
>>> +        runstate_set(old_state);
>>> +        if (s->vm_was_running) {
>>> +            vm_start();
>>> +        }
>>> +        qemu_mutex_unlock_iothread();
>> Couldn't we just have an extra patch that just stores pend_pre and 
>> pending_size
>> in MigrateState, which would allow all this check to be moved into
>> migration_completion(). Or maybe that wasn't an option for some other reason?
> This is not an option, because we don't have a way to get back from
> migration_completion() to migrate_iteration_run() if there is not enough
> space to send all the state.
>> Additionally what about having a migration helper function that
>> vfio_save_complete_precopy() callback needs to use into to check if the
>> expected-device state size meets the threshold/downtime as it is saving the
>> device state and otherwise fail earlier accordingly when saving beyond the
>> threshold?
> That is what I tried to do with the code.
> See patch 4. ram.c
So what I was saying earlier was to have something like a:

        migration_check_pending(ms, device_state_size)

Which would call into migration core to check the SLA is still met. But callable
from the client (hw/vfio/migration) as opposed to the core calling into the
client. This allows that the client controls when to stop the VF

The copy to userspace one could probably be amortized via pread() at
at an arbritary offset to read 1 byte, and get an estimate size. Say you could
check that the size is readable with a @threshold_size + 1 offset without
necessarily copying the whole thing to userspace. If it reads succesfully it
would bailing off earlier (failing the migration) -- and it would avoid doing
the full copy to userspace.

But the one gotcha is the STOP -> STOP_COPY needs to
happen and that is what triggers the transfer the device state into
kernel-managed pages before userspace decides to do anything to access/copy said

> How I have two functions:
> - ram_state_pending_estimate()
> - ram_state_pending_exact()
> 1st should work without any locking, i.e. just best estimation without
> too much work, second should give the real value.
Right -- I did notice that

> What we had discussed before for vfio is that the device will "cache"
> the value returned from previous ram_state_pending_exact().

A concern I see is that if we stop and resume the VF on a regular basis to
accommodate a calculation just to be made available throughout all migration
flow -- it is going to be a little invasive from guest performance/behaviour 

Perhaps this check ought to be amortized at different major stage transitions of
migration (as opposed to any iteration) as this can end up happening very often
as soon as non-VFIO state + dirty pages get to a small enough threshold?

>> It would allow supporting both the (current UAPI) case where you need to
>> transfer the state to get device state size (so checking against 
>> threshold_size
>> pending_pre constantly would allow to not violate the SLA) as well as any 
>> other
>> UAPI improvement to fseek()/data_size that lets you fail even earlier.
>> Seems like at least it keeps some of the rules (both under iothread lock) as
>> this patch
> Coming to worse thing, you need to read the state into a local buffer
> and then calculate the size in exact?  Looks overkill, but in your case,
> I can't see other way to do it.
> My understanding was that for new hardware you have an easy way to
> calculate this value "if the guest was stopped".

> My next series are a way to do in parallel the read/send of the state
> with respect to the migration_thread(), but that is a different
> problem. 

There's also non-blocking reads not sure it helps towards the asynchronous 

> I need a way to calculate sizes to start with, 
> otherwise  I
> have no way to decide if I can enter the completion stage (or for old
> devices one can lie and assume than size is zero, 

to be fair that's what happens right now with migration v1 protocol -- it's not
unprecedented IIUC

> but then you are going
> to have bigger downtimes).
> Later, Juan.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]