qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support


From: Chegu Vinod
Subject: Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
Date: Thu, 06 Jun 2013 16:51:40 -0700
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130509 Thunderbird/17.0.6

On 6/1/2013 9:09 PM, Michael R. Hines wrote:
All,

I have successfully performed over 1000+ back-to-back RDMA migrations automatically looped *in a row* using a heavy-weight memory-stress benchmark here at IBM. Migration success is done by capturing the actual serial console output of the virtual machine while the benchmark is running and redirecting each migration output to a file to verify that the output matches the expected output of a successful migration. For half of the 1000 migrations, I used a 14GB virtual machine size (largest VM I can create) and the remaining 500 migrations I used a 2GB virtual machine (to make sure I was testing both 32-bit and 64-bit address boundaries). The benchmark is configured to have 75% stores and 25% loads and is configured to use 80% of the allocatable free memory of the VM (i.e. no swapping allowed).

I have defined a successful migration per the output file as follows:

1. The memory benchmark is still running and active (CPU near 100% and memory usage is high) 2. There are no kernel panics in the console output (regex keywords "panic", "BUG", "oom", etc...)
3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messages throughout the life of the VM to the console from inside the VM using the 'write' command in infinite loop.

With this method in a loop, I believe I've ironed out all the regression-testing bugs that I can find. You all may find the following bugs interesting. The original version of this patch was written in 2010 (Before my time @ IBM).

Bug #1: In the original 2010 patch, each write operation uses the same "identifier". (A "Work Request ID" in infiniband terminology). This is not typical (but allowed by the hardware) - and instead each operation should have its own unique identifier so that the write operation can be tracked properly as it completes.

Bug #2: Also in the original 2010 patch, write operations were grouped into separate "signaled" and "unsignaled" work requests, which is also not typical (but allowed by the hardware). "Signalling" is infiniband terminology which means to activate/deactivate notifying the sender whether or not the RDMA operation has already completed. (Note: the receiver is never notified - which is what a DMA is supposed to be). In normal operation per infiniband specifications, "unsignaled" operations (which indicate to the hardware *not* to notify the sender of completion) are *supposed* to be paired simultaneously with a signaled operation using the *same* work request identifier. Instead, the original patch was using *different* work requests for signaled/unsignaled writes, which means that most of the writes would be transmitted without ever being tracked for completion whatsoever. (Per infinband specifications, signaled and unsignaled writes must be grouped together because the hardware ensures that completion notification is not given until *all* of the writes of the same request have actually completed).

Bug #3: Finally, in the original 2010 patch, ordering was not being handled. Per infiniband specifications, writes can happen completely out of order. Not only that, but PCI-express itself can change the order of the writes as well. It was only until after the first 2 bugs were fixed that I could actually manifest this bug *in code*: What was happening was that a very large group of requests would "burst" from the QEMU migration thread. At which point, not all of the requests would finish. Then a short time later, the next iteration would start and the virtual machine's writable working set was still "hovering" somewhere in the same vicinity of the address space as the previous burst of writes that had not yet completed. When this happens, the new writes were much smaller (not a part of a larger "chunk" per our algorithms). Since the new writes were smaller they would complete faster than the larger, older writes in the same address range. Since they complete out of order, the newer writes would then get clobbered by the older writes - resulting in an inconsistent virtual machine. So, to solve this: during each new write, we now do a "search" to see if the address of the next requested write matches or overlaps with the address range of any of the previous "outstanding" writes that were still in transit, and I found several hits. This was easily solved by blocking until the conflicting write has completed before proceeding to issue a new write to the hardware.

- Michael


Hi Michael,

Got some limited time on the systems so gave your latest bits a quick try today (with the default no pinning) and it seems to be better than before.

Ran a Java warehouse workload where the guest was 85-90% busy...

For both cases
(qemu) migrate_set_speed 40G
(qemu) migrate_set_downtime 2
(qemu) migrate -d x-rdma:<ip>:<port>

...

20VCPU/256G guest

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 106994 milliseconds
downtime: 3795 milliseconds
transferred ram: 15425453 kbytes
throughput: 20418.27 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64707112 pages
skipped: 0 pages
normal: 3839625 pages
normal bytes: 15358500 kbytes

----

40VCPU/512G guest <- I had more warehouse threads with higher heap size etc. to make the guest busy...and hence it seems to have taken a while to converge.

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 2470056 milliseconds
downtime: 6254 milliseconds
transferred ram: 3230142002 kbytes
throughput: 22118.67 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 127436402 pages
skipped: 0 pages
normal: 807307274 pages
normal bytes: 3229229096 kbytes


<..>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]