Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

From:	Chegu Vinod
Subject:	Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
Date:	Thu, 06 Jun 2013 16:51:40 -0700
User-agent:	Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6

On 6/1/2013 9:09 PM, Michael R. Hines wrote:

All,
I have successfully performed over 1000+ back-to-back RDMA migrationsautomatically looped *in a row* using a heavy-weight memory-stressbenchmark here at IBM.Migration success is done by capturing the actual serial consoleoutput of the virtual machine while the benchmark is running andredirecting each migration output to a file to verify that the outputmatches the expected output of a successful migration. For half of the1000 migrations, I used a 14GB virtual machine size (largest VM I cancreate) and the remaining 500 migrations I used a 2GB virtual machine(to make sure I was testing both 32-bit and 64-bit addressboundaries). The benchmark is configured to have 75% stores and 25%loads and is configured to use 80% of the allocatable free memory ofthe VM (i.e. no swapping allowed).
I have defined a successful migration per the output file as follows:
1. The memory benchmark is still running and active (CPU near 100% andmemory usage is high)2. There are no kernel panics in the console output (regex keywords"panic", "BUG", "oom", etc...)
3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messagesthroughout the life of the VM to the console from inside the VM usingthe 'write' command in infinite loop.
With this method in a loop, I believe I've ironed out all theregression-testing bugs that I can find. You all may find thefollowing bugs interesting. The original version of this patch waswritten in 2010 (Before my time @ IBM).
Bug #1: In the original 2010 patch, each write operation uses the same"identifier". (A "Work Request ID" in infiniband terminology).This is not typical (but allowed by the hardware) - and instead eachoperation should have its own unique identifier so that the writeoperation can be tracked properly as it completes.
Bug #2: Also in the original 2010 patch, write operations were groupedinto separate "signaled" and "unsignaled" work requests, which is alsonot typical (but allowed by the hardware). "Signalling" is infinibandterminology which means to activate/deactivate notifying the senderwhether or not the RDMA operation has already completed. (Note: thereceiver is never notified - which is what a DMA is supposed to be).In normal operation per infiniband specifications, "unsignaled"operations (which indicate to the hardware *not* to notify the senderof completion) are *supposed* to be paired simultaneously with asignaled operation using the *same* work request identifier. Instead,the original patch was using *different* work requests forsignaled/unsignaled writes, which means that most of the writes wouldbe transmitted without ever being tracked for completion whatsoever.(Per infinband specifications, signaled and unsignaled writes must begrouped together because the hardware ensures that completionnotification is not given until *all* of the writes of the samerequest have actually completed).
Bug #3: Finally, in the original 2010 patch, ordering was not beinghandled. Per infiniband specifications, writes can happen completelyout of order. Not only that, but PCI-express itself can change theorder of the writes as well. It was only until after the first 2 bugswere fixed that I could actually manifest this bug *in code*: What washappening was that a very large group of requests would "burst" fromthe QEMU migration thread. At which point, not all of the requestswould finish. Then a short time later, the next iteration would startand the virtual machine's writable working set was still "hovering"somewhere in the same vicinity of the address space as the previousburst of writes that had not yet completed. When this happens, the newwrites were much smaller (not a part of a larger "chunk" per ouralgorithms). Since the new writes were smaller they would completefaster than the larger, older writes in the same address range. Sincethey complete out of order, the newer writes would then get clobberedby the older writes - resulting in an inconsistent virtual machine.So, to solve this: during each new write, we now do a "search" to seeif the address of the next requested write matches or overlaps withthe address range of any of the previous "outstanding" writes thatwere still in transit, and I found several hits. This was easilysolved by blocking until the conflicting write has completed beforeproceeding to issue a new write to the hardware.
- Michael

Hi Michael,

Got some limited time on the systems so gave your latest bits a quicktry today (with the default no pinning) and it seems to be better thanbefore.


Ran a Java warehouse workload where the guest was 85-90% busy...

For both cases
(qemu) migrate_set_speed 40G
(qemu) migrate_set_downtime 2
(qemu) migrate -d x-rdma:<ip>:<port>

...

20VCPU/256G guest

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 106994 milliseconds
downtime: 3795 milliseconds
transferred ram: 15425453 kbytes
throughput: 20418.27 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64707112 pages
skipped: 0 pages
normal: 3839625 pages
normal bytes: 15358500 kbytes

----

40VCPU/512G guest <- I had more warehouse threads with higherheap size etc. to make the guest busy...and hence it seems to have takena while to converge.


(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 2470056 milliseconds
downtime: 6254 milliseconds
transferred ram: 3230142002 kbytes
throughput: 22118.67 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 127436402 pages
skipped: 0 pages
normal: 807307274 pages
normal bytes: 3229229096 kbytes


<..>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support, Michael R. Hines, 2013/06/02
- Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support, Chegu Vinod <=
  - Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support, Michael R. Hines, 2013/06/07

Prev by Date: [Qemu-devel] vhost-scsi and pscsi
Next by Date: Re: [Qemu-devel] [QEMU question] Disk hot plugging without working PCI hotplug - possible?
Previous by thread: Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
Next by thread: Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
Index(es):
- Date
- Thread