qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [QEMU PATCH] kvmclock: advance clock by time window bet


From: Radim Krčmář
Subject: Re: [Qemu-devel] [QEMU PATCH] kvmclock: advance clock by time window between vm_stop and pre_save
Date: Fri, 4 Nov 2016 21:07:03 +0100

2016-11-04 16:29-0200, Marcelo Tosatti:
> On Fri, Nov 04, 2016 at 06:34:20PM +0100, Radim Krčmář wrote:
>> 2016-11-04 14:24-0200, Marcelo Tosatti:
>> > On Fri, Nov 04, 2016 at 04:48:28PM +0100, Radim Krčmář wrote:
>> >> 2016-11-04 16:33+0100, Paolo Bonzini:
>> >> > On 04/11/2016 16:25, Radim Krčmář wrote:
>> >> >>> >  
>> >> >>> > +        if (s->advance_clock && s->clock + s->advance_clock > 
>> >> >>> > s->clock) {
>> >> >>> > +            s->clock += s->advance_clock;
>> >> >>> > +            s->advance_clock = 0;
>> >> >>> > +        }
>> >> >> Can't the advance_clock added to the migrated KVMClockState instead of
>> >> >> passing it as another parameter?
>> >> >> 
>> >> >> (It is sad that we can't just query KVMClockState in kvmclock_pre_save
>> >> >>  because of the Linux bug.)
>> >> > 
>> >> > What Linux bug?  The one that makes us use kvmclock_current_nsec?
>> >> 
>> >> No, the one that forced Marcelo to add the 10 minute limit to the
>> >> advance_clock. 
>> > 
>> > Overflow when reading clocksource, in the timer interrupt.
>> 
>> Is it still there?  I wonder why 10 minutes is the limit. :)
> 
> Second question: I don't know, i guess it started as a 
> "lets clamp this to values that make sense to catch
> any potential offenders" but now i think it makes 
> sense to say:
> 
> "because it does not make sense to search for the smaller
> overflow of Linux guests and use that".
> 
> So 10minutes works for me using the sentence:
> "migration should not take longer than 10 minutes".
> 
> You prefer to drop that 10 minutes check?

I would.  The overflow should mean that the value will be off, but it
would already be, so being a little more wrong doesn't seem worth the
extra code.
It is minor and v2 has r-b, so let's just forget about it. :)

> First question:
> 
> I suppose so: disable CLOCK_MONOTONIC counting on pause, 
> pause your VM for two days, and resume.
> 
> timekeeping_resume
> 
>         cycle_now = tk->tkr_mono.read(clock);
>         if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
>                 cycle_now > tk->tkr_mono.cycle_last) {
>                 u64 num, max = ULLONG_MAX;
>                 u32 mult = clock->mult;
>                 u32 shift = clock->shift;
>                 s64 nsec = 0;
> 
>                 cycle_delta = clocksource_delta(cycle_now,
> tk->tkr_mono.cycle_last,
>                                                 tk->tkr_mono.mask);
> 
>                 /*
>                  * "cycle_delta * mutl" may cause 64 bits overflow, if
>                  * the
>                  * suspended time is too long. In that case we need do
>                  * the
>                  * 64 bits math carefully
>                  */
>                 do_div(max, mult);
>                 if (cycle_delta > max) {
>                         num = div64_u64(cycle_delta, max);
>                         nsec = (((u64) max * mult) >> shift) * num;
>                         cycle_delta -= num * max;
>                 }
>                 nsec += ((u64) cycle_delta * mult) >> shift;

This seems like it handles the overflow.  And does it tragically
ineffectively, so I can understand that they didn't want to do that in
interrupt context.  Using 128 bit multiplication (on x86_64) and shift
would get it done in a fraction of the time.

[...]
> For example.

I found the 10 minues in __clocksource_update_freq_scale():

        if (freq) {
                /*
                 * Calc the maximum number of seconds which we can run before
                 * wrapping around. For clocksources which have a mask > 32-bit
                 * we need to limit the max sleep time to have a good
                 * conversion precision. 10 minutes is still a reasonable
                 * amount. That results in a shift value of 24 for a
                 * clocksource with mask >= 40-bit and f >= 4GHz. That maps to
                 * ~ 0.06ppm granularity for NTP.
                 */
                sec = cs->mask;
                do_div(sec, freq);
                do_div(sec, scale);
                if (!sec)
                        sec = 1;
                else if (sec > 600 && cs->mask > UINT_MAX)
                        sec = 600;

                clocks_calc_mult_shift(&cs->mult, &cs->shift, freq,
                                       NSEC_PER_SEC / scale, sec * scale);
        }

>> >>  We wouldn't need this advance_clock hack if we could
>> >> just call KVM_GET_CLOCK like we did before 00f4d64ee76e ("kvmclock:
>> >> clock should count only if vm is running").
>> > 
>> > People have requested that CLOCK_MONOTONIC stops when 
>> > vm suspends.
>> 
>> Ugh, we could have done that on the OS level, 
> 
> Done it how at the OS level ?

With the use of another interface steal-time like interface.

>> but stopping kvmclock
>> means that we sabotaged CLOCK_BOOTTIME as it doesn't return
>> CLOCK_MONOTONIC + time spent in suspend anymore.
> 
> It does:
> So this patchset introduces CLOCK_BOOTTIME, which is identical
> to CLOCK_MONOTONIC, but includes any time spent in suspend.
> 
> "suspend" refers to suspend-to-RAM, not "virtual machine stopped".
> 
> CLOCK_BOOTTIME fails to work when you extend "suspend" 
> to "suspend AND virtual machine pause or restoration".
> 
> (if you want that, can be fixed easily i supposed, but 
> nobody ever asked for it).

Yeah, it's just weird, because the reason why people wanted
CLOCK_MONOTONIC was quite stupid (to see how long it was unpaused ...)
while having a useable timesource seems quite important.

>> And kvmclock is defined to count nanosecond (slightly different in
>> practice) since some point in time, so pausing it is not really valid.
> 
> Sorry, where is that defined? Nobody makes that assumption
> (that kvmclock should be correlated to some physical time
> and that it can't stop counting).

Documentation/virtual/kvm/msr.txt

  system_time: a host notion of monotonic time, including sleep
  time at the time this structure was last updated. Unit is
  nanoseconds.

and arch/x86/include/asm/pvclock-abi.h:

  * pvclock_vcpu_time_info holds the system time and the tsc timestamp
  * of the last update. So the guest can use the tsc delta to get a
  * more precise system time.  There is one per virtual cpu.
  *
  * pvclock_wall_clock references the point in time when the system
  * time was zero (usually boot time), thus the guest calculates the
  * current wall clock by adding the system time.

>> >> > It should work with 4.9-rc (well, once Linus applies my pull request).
>> >> > 4.9-rc will not return ktime_get_ns for KVM_GET_CLOCK; it will return
>> >> > the raw value from the kernel timekeeper.
>> >> > 
>> >> > I'm thinking that we should add a KVM capability for this, and skip
>> >> > kvmclock_current_nsec if the capability is present.  The first part is
>> >> > trivial, so we can do it even during Linux rc period.
>> >> 
>> >> I agree, that sounds like a nice improvement.
>> > 
>> > I fail to see how this (KVM_GET_CLOCK using raw value (vs NTP corrected)
>> > relates to clocksource overflow and CLOCK_MONOTONIC stop requests.
>> 
>> If the guest uses master clock, then kvmclock drifts away from BOOTTIME,
>> so returning the value of kvmclock using kernel_ns() would introduce a
>> drift when setting the clock -- we must read the same way that the guest
>> does.
> 
> Does not work:
> 
>     (periodic incremental reads from TSC, using the offset from
>      previous TSC read, storing to scaling to a memory location)
>     and reads interpolating with TSC (what kernel does)
> 
>         != (different values than)
> 
>     (TSC scaled to nanoseconds)
> 
> Its not obvious to me this should be the case, perhaps if done 
> carefully can work.

This can't be equal if the TSC frequency is not constant, and it is not.
KVM master clock is using different TSC frequency than the kernel.
The current state is weird, because if we ever recompute the master
clock, we use kvm_get_time_and_clockread() for the new base, which
shifts the time. :/

We could match CLOCK_BOOTTIME with kvmclock if we wanted ...  It would
be much saner, but we'd recomputing the master clock every time the host
time gets updated.

> (See the thread with Roman, this was determined empirically).
> 
>> > -----
>> > 
>> > Todo list (collective???):
>> > 
>> > 1) Implement suggestion to Roman: 
>> > "Subject: Re: [PATCH] x86/kvm: fix condition to update kvm master
>> > clocks" to handle the time backwards when updating masterclock 
>> > from kernel clock.
>> > 
>> > 2) Review TSC scaling support (make sure scaled TSC 
>> > is propagated properly everywhere).
>> > 
>> > 3) Enable migration with invariant TSC.
>> > 
>> > 4) Fullvirt fix for local APIC deadline timer bug
>> 
>> The list looks good.
>> 
>> I'll resume work on (4) now that rework of APIC timers is merged.
>> It's going to be ugly on the guest side, because linux could be using it
>> even when not using kvmclock for sched clock ...
> 
> From previous discussion on the thread Eduardo started:
|> 
|>>> > Actually, a nicer fix would be to check the different
|>>> > frequencies and scale the deadline relative to the difference.
|>>>
|>>> You cannot know what exactly the guest was thinking when it set the
|>>> TSC
|>>> deadline.  Perhaps it wanted an interrupt when the TSC was exactly
|>>> 9876543210.
|> 
|> Yep, the spec just says that the timer fires when TSC >= deadline.
|> 
|>> You can't expect the correlation between TSC value and timer interrupt
|>> execution to be precise, because of the delay between HW timer
|>> expiration and interrupt execution.
|> 
|> Yes, this is valid.
|> 
|>> So you have to live with the fact that the TSC deadline timer can be
|>> late (which is the same thing as with your paravirt solution, in case
|>> of migration to host with faster TSC freq) (which to me renders the
|>> argument above invalid).
|>>
|>> Solution for old guests and new guests:
|>> Just save how far ahead in the future the TSC deadline timer is
|>> supposed
|>> to expire on the source (in ns), in the destination save all registers
|>> (but disable TSC deadline timer injection), arm a timer in QEMU
|>> for expiration time, reenable TSC deadline timer reinjection.
|> 
|> The interrupt can also fire early after migrating to a TSC with lower
|> frequency, which violates the definition of TSC deadline timer when an
|> OS could even RDTSC a value lower than the deadline in the interrupt
|> handler.  (An OS doesn't need to expect that.)
|> 
|> We already have vcpu->arch.virtual_tsc_khz that ought to remain constant
|> for a lifetime of the guest and KVM uses it to convert TSC delta into
|> correct nanoseconds.
|> 
|> The main problem is that QEMU changes virtual_tsc_khz when migrating
|> without hardware scaling, so KVM is forced to get nanoseconds wrong ...
|> 
|> If QEMU doesn't want to keep the TSC frequency constant, then it would
|> be better if it didn't expose TSC in CPUID -- guest would just use
|> kvmclock without being tempted by direct TSC accesses.
|> 
|> 
>> > (BTW Paolo Windows is also vulnerable right).
>> 
>> Ugh, we can't do much more than disabling TSC deadline there until QEMU
>> can migrate it.
> 
> So the idea was to convert to nanoseconds, count a timer
> in nanoseconds, and when it expires inject immediatelly.
> (Yes its ugly).

Problematic, because we can't know that the timer was used to count real
time: maybe the guest really wanted to count TSC cycles.  And it is just
wrong if we would fire the timer while the guest could rdtsc() a value
before the deadline in its handler, which might happen if we migrated to
a TSC with a lower frequency.

> (btw i suppose you can talk to destination via the command interface 
> on the postcopy patches now), say to get the destination tsc frequency.

The problem is that changing TSC frequency on the host affects all
following interrupts, because KVM is not informed of the new frequency.
The one timer armed during migration is relatively insignificant, so
Paolo's patches even ignored it.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]