[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [qemu patch V3 2/2] kvmclock: reduce kvmclock differenc
From: |
Eduardo Habkost |
Subject: |
Re: [Qemu-devel] [qemu patch V3 2/2] kvmclock: reduce kvmclock difference on migration |
Date: |
Mon, 5 Dec 2016 22:47:29 -0200 |
User-agent: |
Mutt/1.7.1 (2016-10-04) |
On Mon, Dec 05, 2016 at 03:20:16PM -0200, Marcelo Tosatti wrote:
> On Mon, Dec 05, 2016 at 01:17:42PM -0200, Eduardo Habkost wrote:
[...]
> > > >
> > > > +static void kvm_get_clock(KVMClockState *s)
> > > > +{
> > > > + struct kvm_clock_data data;
> > > > + int ret;
> > > > +
> > > > + ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> > > > + if (ret < 0) {
> > > > + fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> > > > + abort();
> > > > + }
> > > > + s->clock = data.clock;
> > > > + s->clock_is_reliable = kvm_has_adjust_clock_stable();
> > > > +}
> > > > +
> > >
> > > s->clock_is_reliable has to updated at migration, after
> > > the first time KVM_SET_CLOCK is called. There is no other
> > > place where it should be updated.
> >
> > I don't see why you claim that. We can simply update
> > s->clock_is_reliable the next time s->clock is updated. The only
> > code that reads s->clock checks s->clock_is_reliable first.
> > s->clock_is_reliable is a flag _about_ the value in s->clock. I
> > don't see why you want to update them at different times, if it
> > means more complex logic.
>
> Because the table of (wanted) behaviour dictates that.
My suggestion also follows the table strictly.
>
> > > This is the behaviour
> > > desired, according to this table:
> > >
> > > index:
> > > mig->c means use s->clock value from migration
> > > mig->m means read pvclock value from memory
> > > (both at migration, only the first time kvmclock_change_vmstate
> > > is called).
> > >
> > > runt->c means use KVM_GET_CLOCK/KVM_SET_CLOCK at runtime
> > > runt->m means read pvclock value from memory
> > > (both at runtime, after the first time kvmclock_change_vmstate
> > > is called).
> > >
> > > SOURCE DESTINATION BEHAVIOUR
> > > get_clock_reliable get_clock_reliable mig->c,runt->c 1
> > > get_clock_reliable ~get_clock_reliable mig->c,runt->m 2
> > > ~get_clock_reliable get_clock_reliable mig->m,runt->c 3
> > > ~get_clock_reliable ~get_clock_reliable mig->m,runt->m 4
> > >
> > > So looking at your patch below, and thinking of
> > > kvm_get_clock() updating s->clock_is_reliable from
> > > !running case (that is RUNNING -> ~RUNNING transition)
> > > makes no sense at all. That should not happen.
> >
> > Why not?
>
> Because we want the behaviour of the table above as follows:
>
> First execution of ~RUNNING -> RUNNING transition:
>
> * If source host supports reliable KVM_GET_CLOCK, then
> its not necessary to read from guest memory (because there
> is no danger of KVM_GET_CLOCK returning a value which will cause
> the guest to crash).
Correct. And this can be generalized to:
"If the host that set s->clock support reliable KVM_GET_CLOCK,
then its not necessary to read from guest memory [...]".
>
> * If source host does NOT support reliable KVM_GET_CLOCK, then
> it is necessary to read from guest memory (because there
> is danger of KVM_GET_CLOCK returning a value which will cause
> the guest to crash).
Correct. And this can be generalized to:
"If host that set s->clock do NOT support reliable KVM_GET_CLOCK,
then it is necessary to read from guest memory [...]"
>
> Second and subsequent executions of ~RUNNING -> RUNNING transition:
> (say if you pause the guest and continue it):
>
> * If host supports reliable KVM_GET_CLOCK, then
> its not necessary to read from guest memory (because there
> is no danger of KVM_GET_CLOCK returning a value which will cause
> the guest to crash).
>
Correct. And this can be generalized to:
"If the host that set s->clock support reliable KVM_GET_CLOCK,
then its not necessary to [...]".
> * If host does NOT support reliable KVM_GET_CLOCK, then
> it is necessary to read from guest memory (because there
> is danger of KVM_GET_CLOCK returning a value which will cause
> the guest to crash).
>
Correct. And this can be generalized to:
"If host that set s->clock do NOT support reliable KVM_GET_CLOCK,
then it is necessary to read from guest memory [...]"
> Does that make sense?
Yes. And settting s->clock_is_reliable and s->clock at the same
time guarantees that.
>
> > > s->clock_is_reliable should be updated after the first
> > > time ~RUNNING -> RUNNING migration is done, which
> > > is my patch does.
> >
> > I don't see why. All we need is that s->clock_is_reliable get
> > updated before any code try to read s->clock again.
>
> > > I think your s->clock,s->clock_is_reliable should be in sync
> > > suggestion comes from a misunderstanding of how they are used.
> > >
> > > s->clock is not read if s->clock_is_reliable is set.
> >
> > Did you mean "s->clock is not read if s->clock_is_reliable is
> > false"? You are right (except when kvmclock_current_nsec()==0?).
> > But this also means that the only code that reads s->clock also
> > checks s->clock_is_reliable first.
> >
> > >
> > > Look at my attached patch below (still untested, will test and then
> > > resubmit) and check whether it matches the behaviour
> > > described in the table above.
> >
> > It seems to match. But I don't see why my patch doesn't match the
> > table above.
>
> Its just fragile Eduardo: if anyone comes and changes the
> code as follows.
>
> Source: Does not support reliable KVM_GET_CLOCK.
> Destination: Does support reliable KVM_GET_CLOCK.
>
> * Migration from remote host that does not support
> reliable KVM_GET_CLOCK, to a host that supports it.
>
> * Some modification is done so that:
>
> 1) Incoming migration sets s->clock_is_reliable=false from source.
> 2) _Before_ vm_start(), new code decides to issue kvm_get_clock().
>
> Your code sets s->clock_reliable = true (because local host supports
> it), and failure. Do you see the problem?
The whole point of (my version of) kvm_get_clock() is to set
s->clock. If any code wants to change s->clock before vm_start(),
we are already screwed. I really don't think this is a realistic
scenario.
If you also want a function that returns KVM_GET_CLOCK but do not
change s->clock or s->clock_is_reliable, you can write one and I
won't complain.
>
> (*)
>
> > > > static void kvmclock_vm_state_change(void *opaque, int running,
> > > > RunState state)
> > > > {
> > > > @@ -91,16 +112,18 @@ static void kvmclock_vm_state_change(void *opaque,
> > > > int running,
> > > >
> > > > if (running) {
> > > > struct kvm_clock_data data = {};
> > > > - uint64_t time_at_migration = kvmclock_current_nsec(s);
> > > >
> > > > s->clock_valid = false;
> > > >
> > > > - /* We can't rely on the migrated clock value, just discard it
> > > > */
> > > > - if (time_at_migration) {
> > > > - s->clock = time_at_migration;
> > > > + /* if host that set s->clock did not support reliable
> > > > KVM_GET_CLOCK,
> > > > + * read kvmclock value from memory
> > > > + */
> > > > + if (!s->clock_is_reliable) {
> > > > + data.clock = kvmclock_current_nsec(s);
> > > > + } else {
> > > > + data.clock = s->clock;
> > > > }
> > > >
> > > > - data.clock = s->clock;
> > > > ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> > > > if (ret < 0) {
> > > > fprintf(stderr, "KVM_SET_CLOCK failed: %s\n",
> > > > strerror(ret));
> > > > @@ -120,22 +143,23 @@ static void kvmclock_vm_state_change(void
> > > > *opaque, int running,
> > >
> > > > }
> > > > }
> > > > } else {
> > > > - struct kvm_clock_data data;
> > > > - int ret;
> > > >
> > > > + /*FIXME: is this really possible to happen?
> > > > + * Can kvmclock_vm_state_change(s, 0) be called twice without
> > > > + * a kvmclock_vm_state_change(s, 1) call in between?
> > > > + */
> > >
> > > It can't: check the code where notifiers are called.
> >
> > Yet another reason to simplify the code: if we don't need
> > different code paths for local stop/resume vs migration, we can
> > remove s->clock_valid completely.
>
> Yes it can be removed... But i'm just trying to fix a bug, not rewrite
> code while at it.
No need to remove it right now. But if the logic can make the
extra field unnecessary, that's another reason to simplify it. I
normally don't care about 2 or 3 extra lines of code, but in this
case I really think the complexity you are adding is unnecessary.
>
> > > > if (s->clock_valid) {
> > > > return;
> > > > }
> > > >
> > > > kvm_synchronize_all_tsc();
> > > >
> > > > - ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> > > > - if (ret < 0) {
> > > > - fprintf(stderr, "KVM_GET_CLOCK failed: %s\n",
> > > > strerror(ret));
> > > > - abort();
> > > > - }
> > > > - s->clock = data.clock;
> > > > -
> > > > + kvm_get_clock(s);
> > > > + /* FIXME: the comment below doesn't match what is done at
> > > > + * kvmclock_pre_save(). The comment must be updated, or
> > > > + * kvmclock_pre_save() must be changed. I don't know what's
> > > > + * the right thing to do.
> > > > + */
> >
> > We still need to address this comment. Your patch keeps the
> > comment that contradicts kvmclock_pre_save().
>
> The comment refers to the fact that you should not do:
>
> Event stop_vm()
> saved_clock1 = KVM_GET_CLOCK.
>
> Then again with the VM stopped, overwrite saved_clock:
> saved_clock2 = KVM_GET_CLOCK.
>
> You should always use saved_clock1. Thats what its about.
I am not sure I follow. pre_save runs after stop_vm() and
overwrites saved_clock1 (s->clock) with a new value, doesn't it?
So is it incorrect?
>
> Updated it.
>
> >
> > >
> > > This fails for all cases (1,2,3,4) in the table above.
> >
> > I can't see why it fails. The ordering of events from migration
> > followed by vm stop/resume should be:
> >
> > | Event | s->clock | s->clock_is_reliable
> > src | kvmclock_vm_state_change(0) | - | -
> > src | kvm_get_clock() | TIME-AT-VM-STOP |
> > CAP_ADJUST_CLOCK-on-SRC
> > src | kvmclock_pre_save() | TIME-AT-VM-STOP |
> > CAP_ADJUST_CLOCK-on-SRC
> > src | kvm_get_clock() | TIME-AT-PRE-SAVE |
> > CAP_ADJUST_CLOCK-on-SRC
> > --- | MIGRATION | TIME-AT-PRE-SAVE |
> > CAP_ADJUST_CLOCK-on-SRC
> > dst | kvmclock_vm_state_change(1) | TIME-AT-PRE-SAVE |
> > CAP_ADJUST_CLOCK-on-SRC
> > dst | KVM_SET_CLOCK | TIME-AT-PRE-SAVE |
> > CAP_ADJUST_CLOCK-on-SRC
> > dst | kvmclock_vm_state_change(0) | TIME-AT-PRE-SAVE |
> > CAP_ADJUST_CLOCK-on-SRC
> > dst | kvm_get_clock() | TIME-AT-VM-STOP |
> > CAP_ADJUST_CLOCK-on-DEST
> > dst | kvmclock_vm_state_change(1) | TIME-AT-VM-STOP |
> > CAP_ADJUST_CLOCK-on-DEST
> > dst | KVM_SET_CLOCK | TIME-AT-VM-STOP |
> > CAP_ADJUST_CLOCK-on-DEST
> >
> > Specific cases from your table are below. See the KVM_SET_CLOCK
> > rows.
> >
> > Case 1:
> >
> > | Event | s->clock | s->clock_is_reliable
> > src | kvmclock_vm_state_change(0) | - | -
> > src | kvm_get_clock() | TIME-AT-VM-STOP | 1
> > src | kvmclock_pre_save() | TIME-AT-VM-STOP | 1
> > src | kvm_get_clock() | TIME-AT-PRE-SAVE | 1
> > --- | MIGRATION | TIME-AT-PRE-SAVE | 1
> > dst | kvmclock_vm_state_change(1) | TIME-AT-PRE-SAVE | 1
> > dst | KVM_SET_CLOCK(s->clock) | TIME-AT-PRE-SAVE | 1
> > dst | kvmclock_vm_state_change(0) | TIME-AT-PRE-SAVE | 1
> > dst | kvm_get_clock() | TIME-AT-VM-STOP | 1
> > dst | kvmclock_vm_state_change(1) | TIME-AT-VM-STOP | 1
> > dst | KVM_SET_CLOCK(s->clock) | TIME-AT-VM-STOP | 1
> >
> > Case 2:
> >
> > | Event | s->clock | s->clock_is_reliable
> > src | kvmclock_vm_state_change(0) | - | -
> > src | kvm_get_clock() | TIME-AT-VM-STOP | 1
> > src | kvmclock_pre_save() | TIME-AT-VM-STOP | 1
> > src | kvm_get_clock() | TIME-AT-PRE-SAVE | 1
> > --- | MIGRATION | TIME-AT-PRE-SAVE | 1
> > dst | kvmclock_vm_state_change(1) | TIME-AT-PRE-SAVE | 1
> > dst | KVM_SET_CLOCK(s->clock) | TIME-AT-PRE-SAVE | 1
> > dst | kvmclock_vm_state_change(0) | TIME-AT-PRE-SAVE | 1
> > dst | kvm_get_clock() | TIME-AT-VM-STOP | 0
> > dst | kvmclock_vm_state_change(1) | TIME-AT-VM-STOP | 0
> > dst | KVM_SET_CLOCK(from_mem) | TIME-AT-VM-STOP | 0
> >
> > Case 3:
> >
> > | Event | s->clock | s->clock_is_reliable
> > src | kvmclock_vm_state_change(0) | - | -
> > src | kvm_get_clock() | TIME-AT-VM-STOP | 0
> > src | kvmclock_pre_save() | TIME-AT-VM-STOP | 0
> > src | kvm_get_clock() | TIME-AT-PRE-SAVE | 0
> > --- | MIGRATION | TIME-AT-PRE-SAVE | 0
> > dst | kvmclock_vm_state_change(1) | TIME-AT-PRE-SAVE | 0
> > dst | KVM_SET_CLOCK(from_mem) | TIME-AT-PRE-SAVE | 0
> > dst | kvmclock_vm_state_change(0) | TIME-AT-PRE-SAVE | 0
> > dst | kvm_get_clock() | TIME-AT-VM-STOP | 1
> > dst | kvmclock_vm_state_change(1) | TIME-AT-VM-STOP | 1
> > dst | KVM_SET_CLOCK(s->clock) | TIME-AT-VM-STOP | 1
> >
> > Case 4:
> >
> > | Event | s->clock | s->clock_is_reliable
> > src | kvmclock_vm_state_change(0) | - | -
> > src | kvm_get_clock() | TIME-AT-VM-STOP | 0
> > src | kvmclock_pre_save() | TIME-AT-VM-STOP | 0
> > src | kvm_get_clock() | TIME-AT-PRE-SAVE | 0
> > --- | MIGRATION | TIME-AT-PRE-SAVE | 0
> > dst | kvmclock_vm_state_change(1) | TIME-AT-PRE-SAVE | 0
> > dst | KVM_SET_CLOCK(from_mem) | TIME-AT-PRE-SAVE | 0
> > dst | kvmclock_vm_state_change(0) | TIME-AT-PRE-SAVE | 0
> > dst | kvm_get_clock() | TIME-AT-VM-STOP | 0
> > dst | kvmclock_vm_state_change(1) | TIME-AT-VM-STOP | 0
> > dst | KVM_SET_CLOCK(from_mem) | TIME-AT-VM-STOP | 0
> >
> >
> >
[...]
> > > static void kvmclock_vm_state_change(void *opaque, int running,
> > > RunState state)
> > > {
> > > @@ -91,15 +111,31 @@
> > >
> > > if (running) {
> > > struct kvm_clock_data data = {};
> > > - uint64_t time_at_migration = kvmclock_current_nsec(s);
> > > + uint64_t pvclock_via_mem = 0;
> > >
> > > - s->clock_valid = false;
> > > + /*
> > > + * If the host where s->clock was read did not support reliable
> > > + * KVM_GET_CLOCK, read kvmclock value from memory.
> > > + */
> > > + if (!s->clock_is_reliable) {
> > > + pvclock_via_mem = kvmclock_current_nsec(s);
> > > + }
> > >
> > > - /* We can't rely on the migrated clock value, just discard it */
> > > - if (time_at_migration) {
> > > - s->clock = time_at_migration;
> > > + /* migration/savevm/init restore
> > > + * update clock_is_reliable to match local
> > > + * host capabilities.
> > > + */
> > > + if (s->clock_valid == false) {
> > > + s->clock_is_reliable = kvm_has_adjust_clock_stable();
> > > }
> > >
> >
> > If we are reusing clock_valid for a different purpose, I suggest
> > we rename it to something clearer, like "clock_is_local" or
> > "clock_is_stopped".
> >
> > But I still don't see the need for this convoluted logic, if we
> > can simply set s->clock_is_reliable and s->clock at the same
> > time. This even allow us to remove s->clock_valid completely.
> >
> > > + /* We can't rely on the saved clock value, just discard it */
> > > + if (pvclock_via_mem) {
> > > + s->clock = pvclock_via_mem;
> > > + }
> > > +
> > > + s->clock_valid = false;
> > > +
> >
> > I suggest the following on top of your patch. It removes 27
> > unnecessary lines from the code.
> >
> > But, really, I'm tired of this discussion. If you want to keep
> > the complex logic you suggest and others are happy with it, go
> > ahead. You just won't get my Reviewed-by.
>
> I have been addressing all the comments you made so far Eduardo,
> and have a point at (*) which seems valid to me.
> This is the reason the patch is the way it is now.
>
> But sure, i will include your next patch in the series, test
> it, and if someone does kvm_get_clock() in between those two
> points in the code, then it'll break. I'll add a
> comment to that effect to warn users.
kvm_get_clock() changes s->clock too. If you set s->clock at the
wrong moment you'll have bigger problems than clock_is_reliable
being incorrect. But if your worry is kvm_get_clock(), then we
can just do the following. Would that be OK?
Then I can send a patch to remove clock_is_valid later, as a
follow-up.
(I have one additional comment about pre_save(), below)
Signed-off-by: Eduardo Habkost <address@hidden>
---
hw/i386/kvm/clock.c | 16 ++++------------
1 file changed, 4 insertions(+), 12 deletions(-)
diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index e179862..e2256a6 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -121,14 +121,6 @@ static void kvmclock_vm_state_change(void *opaque, int
running,
pvclock_via_mem = kvmclock_current_nsec(s);
}
- /* migration/savevm/init restore
- * update clock_is_reliable to match local
- * host capabilities.
- */
- if (s->clock_valid == false) {
- s->clock_is_reliable = kvm_has_adjust_clock_stable();
- }
-
/* We can't rely on the saved clock value, just discard it */
if (pvclock_via_mem) {
s->clock = pvclock_via_mem;
@@ -164,6 +156,10 @@ static void kvmclock_vm_state_change(void *opaque, int
running,
kvm_synchronize_all_tsc();
s->clock = kvm_get_clock();
+ /* any code that sets s->clock needs to ensure clock_is_reliable
+ * is correctly set.
+ */
+ s->clock_is_reliable = kvm_has_adjust_clock_stable();
/*
* If the VM is stopped, declare the clock state valid to
* avoid re-reading it on next vmsave (which would return
@@ -177,10 +173,6 @@ static void kvmclock_realize(DeviceState *dev, Error
**errp)
{
KVMClockState *s = KVM_CLOCK(dev);
- if (kvm_has_adjust_clock_stable()) {
- s->clock_is_reliable = true;
- }
-
qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
}
>
[...]
> > > +
> > > +static void kvmclock_pre_save(void *opaque)
> > > +{
> > > + KVMClockState *s = opaque;
> > > +
> >
> > We still need a comment here explaining why we need to re-read
> > the clock on pre_save if s->clock was already set on
> > kvmclock_vm_state_change().
>
> OK.
>
> > Also, what if we are migrating a VM that was already paused 10
> > minutes ago? Should we migrate the s->clock value from
> > 10 minutes ago, or the one read at pre_save?
>
> The one read at pre_save. I'll add a comment.
Is it really valid to make the clock move on an already-paused
VM, only because it was migrated?
>
> > > + s->clock = kvm_get_clock();
> > > +}
> > > +
--
Eduardo