[Qemu-devel] Live migration debugging

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Live migration debugging

From:	Paul Boven
Subject:	[Qemu-devel] Live migration debugging
Date:	Tue, 29 Jul 2014 13:31:46 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0

Hi folks,

Recently there's been several patches to fix kvmclock issues duringmigrations, which were subsequently reverted. I hope the observationsbelow can be helpful in pinning down the actual issues to make livemigration work again in the future.

Live migration has been broken since at least release 1.4.0 (as shippedwith Ubuntu 13.04), and still has the same problems in 2.1.0-rc2, butbriefly worked in 2.0-git-20140609.

The problem is that once the live migration is complete and the guestgets started on the destination server, it will hang for a long time,consuming 100% cpu. This can be mere seconds, but I've also observedhangs for as long as 11 minutes. And then suddenly the guest starts torespond again as if nothing happens, but its clock has not progressed atall while the machine was hanging.

What I have observed is that the time spent hanging is exactly thedifference between the clock rate of the host, and the 'real' (NTP)time. If you multiply the time since the previous migration with the PPMoffset as determined by NTP (see /var/lib/ntp/ntp.drift), that isexactly how many seconds the guest will spend at 100% CPU beforebecoming responsive again. I have observed this on two different pairsof KVM servers. Each of the servers has a negative PPM value accordingto NTP.

Example: a guest having nearly 9 days of uptime, with (according to NTP)a clock rate of -34 ppm, froze for 27 seconds when I migrated it. I havedone quite a few test migrations, and this relationship holds quiteprecisely.

As the duration of the freeze is proportional to the time since theprevious migration, debugging is a bit difficult as you have to wait awhile before you can demonstrate the problem. It is also probably areason this problem is underreported, because it is not very noticeableif you do it right after starting the VM, but looks like a completecrash if you have a few months of uptime.

With the 2.0 sources from 2014-06-09, the problem does *not* occur. Aside-effect of that patch is that the guest clock has a lot of jitteruntil the first migration, but behaves normally (yet without hangs) onsubsequent migrations.

Is there a way that I can directly read the kvmclock from the guest orhost, so we can compare them before and after migration, and see whatgoes wrong precisely?


See also https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1297218

Regards, Paul Boven.
--
Paul Boven <address@hidden> +31 (0)521-596547
Unix/Linux/Networking specialist
Joint Institute for VLBI in Europe - www.jive.nl
VLBI - It's a fringe science

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] Live migration debugging, Paul Boven <=

Prev by Date: Re: [Qemu-devel] [PATCH/RFC 4/5] s390x/kvm: test whether a cpu is STOPPED when checking "has_work"
Next by Date: Re: [Qemu-devel] [PATCH v2] Tap: fix vcpu long time io blocking on tap
Previous by thread: [Qemu-devel] [Bug 1224444] Re: virtio-serial loses writes when used over virtio-mmio
Next by thread: Re: [Qemu-devel] [PATCH v6 3/3] sclp-s390: Add memory hotplug SCLPs
Index(es):
- Date
- Thread