qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support


From: Michael R. Hines
Subject: Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
Date: Sun, 02 Jun 2013 00:09:45 -0400
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130329 Thunderbird/17.0.5

All,

I have successfully performed over 1000+ back-to-back RDMA migrations automatically looped *in a row* using a heavy-weight memory-stress benchmark here at IBM. Migration success is done by capturing the actual serial console output of the virtual machine while the benchmark is running and redirecting each migration output to a file to verify that the output matches the expected output of a successful migration. For half of the 1000 migrations, I used a 14GB virtual machine size (largest VM I can create) and the remaining 500 migrations I used a 2GB virtual machine (to make sure I was testing both 32-bit and 64-bit address boundaries). The benchmark is configured to have 75% stores and 25% loads and is configured to use 80% of the allocatable free memory of the VM (i.e. no swapping allowed).

I have defined a successful migration per the output file as follows:

1. The memory benchmark is still running and active (CPU near 100% and memory usage is high) 2. There are no kernel panics in the console output (regex keywords "panic", "BUG", "oom", etc...)
3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messages throughout the life of the VM to the console from inside the VM using the 'write' command in infinite loop.

With this method in a loop, I believe I've ironed out all the regression-testing bugs that I can find. You all may find the following bugs interesting. The original version of this patch was written in 2010 (Before my time @ IBM).

Bug #1: In the original 2010 patch, each write operation uses the same "identifier". (A "Work Request ID" in infiniband terminology). This is not typical (but allowed by the hardware) - and instead each operation should have its own unique identifier so that the write operation can be tracked properly as it completes.

Bug #2: Also in the original 2010 patch, write operations were grouped into separate "signaled" and "unsignaled" work requests, which is also not typical (but allowed by the hardware). "Signalling" is infiniband terminology which means to activate/deactivate notifying the sender whether or not the RDMA operation has already completed. (Note: the receiver is never notified - which is what a DMA is supposed to be). In normal operation per infiniband specifications, "unsignaled" operations (which indicate to the hardware *not* to notify the sender of completion) are *supposed* to be paired simultaneously with a signaled operation using the *same* work request identifier. Instead, the original patch was using *different* work requests for signaled/unsignaled writes, which means that most of the writes would be transmitted without ever being tracked for completion whatsoever. (Per infinband specifications, signaled and unsignaled writes must be grouped together because the hardware ensures that completion notification is not given until *all* of the writes of the same request have actually completed).

Bug #3: Finally, in the original 2010 patch, ordering was not being handled. Per infiniband specifications, writes can happen completely out of order. Not only that, but PCI-express itself can change the order of the writes as well. It was only until after the first 2 bugs were fixed that I could actually manifest this bug *in code*: What was happening was that a very large group of requests would "burst" from the QEMU migration thread. At which point, not all of the requests would finish. Then a short time later, the next iteration would start and the virtual machine's writable working set was still "hovering" somewhere in the same vicinity of the address space as the previous burst of writes that had not yet completed. When this happens, the new writes were much smaller (not a part of a larger "chunk" per our algorithms). Since the new writes were smaller they would complete faster than the larger, older writes in the same address range. Since they complete out of order, the newer writes would then get clobbered by the older writes - resulting in an inconsistent virtual machine. So, to solve this: during each new write, we now do a "search" to see if the address of the next requested write matches or overlaps with the address range of any of the previous "outstanding" writes that were still in transit, and I found several hits. This was easily solved by blocking until the conflicting write has completed before proceeding to issue a new write to the hardware.

- Michael


On 05/09/2013 06:45 PM, Michael R. Hines wrote:

Some more followup questions below to help me debug before I start digging in.......

On 05/09/2013 06:20 PM, Chegu Vinod wrote:

Setting aside the mlock() freezes for the moment, let's first fix your crashing
problem on the destination-side. Let's make that a priority before we fix
the mlock problem.

When the migration "completes", can you provide me with more detailed information
about the state of QEMU on the destination?

Is it responding?
What's on the VNC console?
Is QEMU responding?
Is the network responding?
Was the VM idle? Or running an application?
Can you attach GDB to QEMU after the migration?


/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-name vm1 \
-m 131072 -smp 10,sockets=1,cores=10,threads=1 \
-mem-path /dev/hugepages \

Can you disable hugepages and re-test?

I'll get back to the other mlock() issues later after we at least first make sure the migration itself is working.....




reply via email to

[Prev in Thread] Current Thread [Next in Thread]