qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] KVM/Qemu migration under fast network stress


From: Andrey Korolyov
Subject: Re: [Qemu-devel] KVM/Qemu migration under fast network stress
Date: Wed, 25 Nov 2015 02:32:09 +0300

On Wed, Nov 25, 2015 at 1:44 AM, Or Gerlitz <address@hidden> wrote:
> Hi,
>
> When doing live migration with iperf running on the migrated VM
> over PV/virtio interface -- it doesn't end when the number of threads > 1 :(
> nor when we run an IO (fio) benchmark which gets high-throughput. None
> of these tests run over the network the migration is carried out.
>
> We get this "error: operation failed: migration job: unexpectedly failed"
>
> The networking/IO benchmarks which executes at the time of migration runs
> over PV/virtio-->vhost-->Mellanox fast Eth NIC.
>
> In our env, with two threads the  BW is ~20Gbs and with one
> thread ~10Gbs, IO BW can be around 600MB/s.
>
> Using the verbose option of virsh we see the migration status goes up to
> 99% and stays there for couple of seconds and then migration fails.
>
> If we used the --unsafe option it does end and goes from 99% to 100%
> at some point.
>
> I wonder if there is something we do wrong or can optimize... we're
> using upstream kernel 4.3 on both hosts and guest, the host OS is RHEL 7.0
> based e.g with qemu qemu-kvm-1.5.3-60.el7.x86_64 - rings some bell?
> would it help to upgrade qemu?
>
> Is there an option to get the live migration process to give up on
> looping over dirty pages
> and stop the v-cpu on the migrate-out host so this migration actually ends?
>
> We're attaching debug messages from virsh for the time  of migration,
> you can see there it's on 99% for couple of seconds and then gives up.
>
> Or.
>
>
>
> 2015-11-24 14:13:47.190+0000: 34957: debug :
> virKeepAliveCheckMessage:374 : ka=0x55d161e4e230,
> client=0x55d161e4df40, msg=0x55d161e4dfa8
> 2015-11-24 14:13:47.190+0000: 34957: debug : virNetMessageClear:55 :
> msg=0x55d161e4dfa8 nfds=0
> 2015-11-24 14:13:47.190+0000: 34957: debug :
> virNetClientIOEventLoopRemoveDone:1376 : Waking up sleep
> 0x55d161e4e520
> 2015-11-24 14:13:47.190+0000: 34955: debug : virNetClientIO:1750 :
> Woken up from sleep head=0x7f3644001d40 call=0x55d161e4e520
> 2015-11-24 14:13:47.190+0000: 34955: debug : virNetClientIO:1803 : All
> done with our call head=0x7f3644001d40 call=0x55d161e4e520 rv=0
> 2015-11-24 14:13:47.190+0000: 34955: debug : virNetMessageFree:72 :
> msg=0x55d161e4e7d0 nfds=0 cb=(nil)
> Migration: [ 99 %]2015-11-24 14:13:47.574+0000: 34956: debug :
> virEventPollRunOnce:638 : Poll got 1 event(s)
>
> The above is the last time we see the "Migration: [ 99 %]" running progress 
> bar
>
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virEventPollDispatchTimeouts:423 : Dispatch 0
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virEventPollDispatchHandles:468 : Dispatch 2
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virEventPollDispatchHandles:482 : i=0 w=1
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virEventPollDispatchHandles:482 : i=2 w=3
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virEventPollDispatchHandles:496 : EVENT_POLL_DISPATCH_HANDLE: watch=3
> events=1
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virNetClientIncomingEvent:1816 : client=0x7f36440019a0 wantclose=0
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virNetClientIncomingEvent:1824 : Event fired 0x7f3644001760 1
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virNetMessageDecodeLength:149 : Got length, now need 28 total (24
> more)
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virNetClientCallDispatch:1123 : RPC_CLIENT_MSG_RX:
> client=0x7f36440019a0 len=28 prog=1801807216 vers=1 proc=1 type=2
> status=0 serial=0
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virKeepAliveCheckMessage:374 : ka=0x7f36440015f0,
> client=0x7f36440019a0, msg=0x7f3644001a08
> 2015-11-24 14:13:47.574+0000: 34956: debug :
> virKeepAliveCheckMessage:391 : RPC_KEEPALIVE_RECEIVED:
> ka=0x7f36440015f0 client=0x7f36440019a0 prog=1801807216 vers=1 proc=1
>
>
> and later we see "error: operation failed: migration job: unexpectedly failed"
>
>
> 015-11-24 14:13:47.766+0000: 34957: debug :
> virNetClientCloseInternal:687 : client=0x7f36440019a0 wantclose=0
> 2015-11-24 14:13:47.766+0000: 34956: debug : virFileClose:90 : Closed fd 16
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f36440019a0
> 2015-11-24 14:13:47.767+0000: 34956: debug : virProcessAbort:104 :
> aborting child process 34958
> 2015-11-24 14:13:47.767+0000: 34956: debug : virProcessAbort:109 :
> process has ended: exit status 1
> 2015-11-24 14:13:47.767+0000: 34956: debug :
> virEventRunDefaultImpl:270 : running default event implementation
> 2015-11-24 14:13:47.767+0000: 34956: debug :
> virEventPollCleanupTimeouts:514 : Cleanup 0
> 2015-11-24 14:13:47.767+0000: 34956: debug :
> virEventPollCleanupTimeouts:550 : Found 0 out of 0 timeout slots used,
> releasing 0
> 2015-11-24 14:13:47.767+0000: 34956: debug :
> virEventPollCleanupHandles:562 : Cleanup 2
> 2015-11-24 14:13:47.767+0000: 34956: debug :
> virEventPollMakePollFDs:391 : Prepare n=0 w=1, f=7 e=1 d=0
> 2015-11-24 14:13:47.767+0000: 34956: debug :
> virEventPollMakePollFDs:391 : Prepare n=1 w=2, f=9 e=1 d=0
> 2015-11-24 14:13:47.767+0000: 34956: debug :
> virEventPollCalculateTimeout:332 : Calculate expiry of 0 timers
> 2015-11-24 14:13:47.767+0000: 34956: debug :
> virEventPollCalculateTimeout:361 : Timeout at 0 due in -1 ms
> 2015-11-24 14:13:47.767+0000: 34956: debug : virEventPollRunOnce:627 :
> EVENT_POLL_RUN: nhandles=2 timeout=-1
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:258 :
> OBJECT_DISPOSE: obj=0x7f36440019a0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virNetClientDispose:599 :
> RPC_CLIENT_DISPOSE: client=0x7f36440019a0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644010c80
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644001da0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644001510
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644001bc0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virFileClose:90 : Closed fd 17
> 2015-11-24 14:13:47.767+0000: 34957: debug : virFileClose:90 : Closed fd 15
> 2015-11-24 14:13:47.767+0000: 34957: debug : virNetMessageClear:55 :
> msg=0x7f3644001a08 nfds=0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644001da0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:258 :
> OBJECT_DISPOSE: obj=0x7f3644001da0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644001510
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:258 :
> OBJECT_DISPOSE: obj=0x7f3644001510
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644001bc0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:258 :
> OBJECT_DISPOSE: obj=0x7f3644001bc0
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644010c80
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:258 :
> OBJECT_DISPOSE: obj=0x7f3644010c80
> 2015-11-24 14:13:47.767+0000: 34957: debug : virDomainFree:2428 :
> dom=0x7f3644000c30, (VM: name=dev-h-vrt-008-006,
> uuid=04e20bfc-89e6-4098-bace-6a78c3e50919)
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x7f3644000c30
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:258 :
> OBJECT_DISPOSE: obj=0x7f3644000c30
> 2015-11-24 14:13:47.767+0000: 34957: debug : virDomainDispose:262 :
> release domain 0x7f3644000c30 dev-h-vrt-008-006
> 04e20bfc-89e6-4098-bace-6a78c3e50919
> 2015-11-24 14:13:47.767+0000: 34957: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x55d161e4d340
> 2015-11-24 14:13:47.767+0000: 34955: debug : virDomainFree:2428 :
> dom=0x55d161e4d2d0, (VM: name=dev-h-vrt-008-006,
> uuid=04e20bfc-89e6-4098-bace-6a78c3e50919)
> 2015-11-24 14:13:47.767+0000: 34955: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x55d161e4d2d0
> 2015-11-24 14:13:47.767+0000: 34955: debug : virObjectUnref:258 :
> OBJECT_DISPOSE: obj=0x55d161e4d2d0
> 2015-11-24 14:13:47.767+0000: 34955: debug : virDomainDispose:262 :
> release domain 0x55d161e4d2d0 dev-h-vrt-008-006
> 04e20bfc-89e6-4098-bace-6a78c3e50919
> 2015-11-24 14:13:47.767+0000: 34955: debug : virObjectUnref:256 :
> OBJECT_UNREF: obj=0x55d161e4d340
> 2015-11-24 14:13:47.767+0000: 34955: debug : virFileClose:90 : Closed fd 12
> 2015-11-24 14:13:47.767+0000: 34955: debug : virFileClose:90 : Closed fd 13
> error: operation failed: migration job: unexpectedly failed

Have you tried to use convergence mechanism which is presented in
contemporary libvirt as well? The problem you are describing was quite
common and general solution is to throttle VM down via cgroups (oldest
one) or via emulator` own vCPU throttling mechanism to allow migration
to be successfully completed. Libvirt can also fall back to migration
of the paused VM which could be assumed as a "last resort" within
migration flow procedures.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]