qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Live migration debugging


From: Paul Boven
Subject: [Qemu-devel] Live migration debugging
Date: Tue, 29 Jul 2014 13:31:46 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0

Hi folks,

Recently there's been several patches to fix kvmclock issues during migrations, which were subsequently reverted. I hope the observations below can be helpful in pinning down the actual issues to make live migration work again in the future.

Live migration has been broken since at least release 1.4.0 (as shipped with Ubuntu 13.04), and still has the same problems in 2.1.0-rc2, but briefly worked in 2.0-git-20140609.

The problem is that once the live migration is complete and the guest gets started on the destination server, it will hang for a long time, consuming 100% cpu. This can be mere seconds, but I've also observed hangs for as long as 11 minutes. And then suddenly the guest starts to respond again as if nothing happens, but its clock has not progressed at all while the machine was hanging.

What I have observed is that the time spent hanging is exactly the difference between the clock rate of the host, and the 'real' (NTP) time. If you multiply the time since the previous migration with the PPM offset as determined by NTP (see /var/lib/ntp/ntp.drift), that is exactly how many seconds the guest will spend at 100% CPU before becoming responsive again. I have observed this on two different pairs of KVM servers. Each of the servers has a negative PPM value according to NTP.

Example: a guest having nearly 9 days of uptime, with (according to NTP) a clock rate of -34 ppm, froze for 27 seconds when I migrated it. I have done quite a few test migrations, and this relationship holds quite precisely.

As the duration of the freeze is proportional to the time since the previous migration, debugging is a bit difficult as you have to wait a while before you can demonstrate the problem. It is also probably a reason this problem is underreported, because it is not very noticeable if you do it right after starting the VM, but looks like a complete crash if you have a few months of uptime.

With the 2.0 sources from 2014-06-09, the problem does *not* occur. A side-effect of that patch is that the guest clock has a lot of jitter until the first migration, but behaves normally (yet without hangs) on subsequent migrations.

Is there a way that I can directly read the kvmclock from the guest or host, so we can compare them before and after migration, and see what goes wrong precisely?

See also https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1297218

Regards, Paul Boven.
--
Paul Boven <address@hidden> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe - www.jive.nl
VLBI - It's a fringe science



reply via email to

[Prev in Thread] Current Thread [Next in Thread]