qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documen


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
Date: Mon, 18 Mar 2013 12:40:13 +0200

On Sun, Mar 17, 2013 at 11:18:56PM -0400, address@hidden wrote:
> From: "Michael R. Hines" <address@hidden>
> 
> This tries to cover all the questions I got the last time.
> 
> Please do tell me what is not clear, and I'll revise again.
> 
> Signed-off-by: Michael R. Hines <address@hidden>
> ---
>  docs/rdma.txt |  208 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 208 insertions(+)
>  create mode 100644 docs/rdma.txt
> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..2a48ab0
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,208 @@
> +Changes since v3:
> +
> +- Compile-tested with and without --enable-rdma is working.
> +- Updated docs/rdma.txt (included below)
> +- Merged with latest pull queue from Paolo
> +- Implemented qemu_ram_foreach_block()
> +
> address@hidden:~/qemu$ git diff --stat master
> +Makefile.objs                 |    1 +
> +arch_init.c                   |   28 +-
> +configure                     |   25 ++
> +docs/rdma.txt                 |  190 +++++++++++
> +exec.c                        |   21 ++
> +include/exec/cpu-common.h     |    6 +
> +include/migration/migration.h |    3 +
> +include/migration/qemu-file.h |   10 +
> +include/migration/rdma.h      |  269 ++++++++++++++++
> +include/qemu/sockets.h        |    1 +
> +migration-rdma.c              |  205 ++++++++++++
> +migration.c                   |   19 +-
> +rdma.c                        | 1511 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +savevm.c                      |  172 +++++++++-
> +util/qemu-sockets.c           |    2 +-
> +15 files changed, 2445 insertions(+), 18 deletions(-)


Above looks strange :)

> +QEMUFileRDMA:

I think there are two things here, API documentation
and protocol documentation, protocol documentation
still needs some more work. Also if what I understand
from this document is correct this breaks memory overcommit
on destination which needs to be fixed.


> +==================================
> +
> +QEMUFileRDMA introduces a couple of new functions:
> +
> +1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> +2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> +
> +These two functions provide an RDMA transport
> +(not a protocol) without changing the upper-level
> +users of QEMUFile that depend on a bytstream abstraction.
> +
> +In order to provide the same bytestream interface 
> +for RDMA, we use SEND messages instead of sockets.
> +The operations themselves and the protocol built on 
> +top of QEMUFile used throughout the migration 
> +process do not change whatsoever.
> +
> +An infiniband SEND message is the standard ibverbs
> +message used by applications of infiniband hardware.
> +The only difference between a SEND message and an RDMA
> +message is that SEND message cause completion notifications
> +to be posted to the completion queue (CQ) on the 
> +infiniband receiver side, whereas RDMA messages (used
> +for pc.ram) do not (to behave like an actual DMA).
> +    
> +Messages in infiniband require two things:
> +
> +1. registration of the memory that will be transmitted
> +2. (SEND only) work requests to be posted on both
> +   sides of the network before the actual transmission
> +   can occur.
> +
> +RDMA messages much easier to deal with. Once the memory
> +on the receiver side is registed and pinned, we're
> +basically done. All that is required is for the sender
> +side to start dumping bytes onto the link.
> +
> +SEND messages require more coordination because the
> +receiver must have reserved space (using a receive
> +work request) on the receive queue (RQ) before QEMUFileRDMA
> +can start using them to carry all the bytes as
> +a transport for migration of device state.
> +
> +After the initial connection setup (migration-rdma.c),

Is there any feature and/or version negotiation? How are we going to
handle compatibility when we extend the protocol?

> +this coordination starts by having both sides post
> +a single work request to the RQ before any users
> +of QEMUFile are activated.

So how does destination know it's ok to send anything
to source?
I suspect this is wrong. When using CM you must post
on RQ before completing the connection negotiation,
not after it's done.

> +
> +Once an initial receive work request is posted,
> +we have a put_buffer()/get_buffer() implementation
> +that looks like this:
> +
> +Logically:
> +
> +qemu_rdma_get_buffer():
> +
> +1. A user on top of QEMUFile calls ops->get_buffer(),
> +   which calls us.
> +2. We transmit an empty SEND to let the sender know that 
> +   we are *ready* to receive some bytes from QEMUFileRDMA.
> +   These bytes will come in the form of a another SEND.
> +3. Before attempting to receive that SEND, we post another
> +   RQ work request to replace the one we just used up.
> +4. Block on a CQ event channel and wait for the SEND
> +   to arrive.
> +5. When the send arrives, librdmacm will unblock us
> +   and we can consume the bytes (described later).

Using an empty message seems somewhat hacky, a fixed header in the
message would let you do more things if protocol is ever extended.

> +qemu_rdma_put_buffer(): 
> +
> +1. A user on top of QEMUFile calls ops->put_buffer(),
> +   which calls us.
> +2. Block on the CQ event channel waiting for a SEND
> +   from the receiver to tell us that the receiver
> +   is *ready* for us to transmit some new bytes.
> +3. When the "ready" SEND arrives, librdmacm will 
> +   unblock us and we immediately post a RQ work request
> +   to replace the one we just used up.
> +4. Now, we can actually deliver the bytes that
> +   put_buffer() wants and return. 

OK to summarize flow control: at any time there's
either 0 or 1 outstanding buffers in RQ.
At each time only one side can talk.
Destination always goes first, then source, etc.
At each time a single send message can be passed.


Just FYI, this means you are often at 0 buffers in RQ and IIRC 0 buffers
is a worst-case path for infiniband. It's better to keep at least 1
buffers in RQ at all times, so prepost 2 initially so it would fluctuate
between 1 and 2.

> +
> +NOTE: This entire sequents of events is designed this
> +way to mimic the operations of a bytestream and is not
> +typical of an infiniband application. (Something like MPI
> +would not 'ping-pong' messages like this and would not
> +block after every request, which would normally defeat
> +the purpose of using zero-copy infiniband in the first place).
> +
> +Finally, how do we handoff the actual bytes to get_buffer()?
> +
> +Again, because we're trying to "fake" a bytestream abstraction
> +using an analogy not unlike individual UDP frames, we have
> +to hold on to the bytes received from SEND in memory.
> +
> +Each time we get to "Step 5" above for get_buffer(),
> +the bytes from SEND are copied into a local holding buffer.
> +
> +Then, we return the number of bytes requested by get_buffer()
> +and leave the remaining bytes in the buffer until get_buffer()
> +comes around for another pass.
> +
> +If the buffer is empty, then we follow the same steps
> +listed above for qemu_rdma_get_buffer() and block waiting
> +for another SEND message to re-fill the buffer.
> +
> +Migration of pc.ram:
> +===============================
> +
> +At the beginning of the migration, (migration-rdma.c),
> +the sender and the receiver populate the list of RAMBlocks
> +to be registered with each other into a structure.

Could you add the packet format here as well please?
Need to document endian-ness etc.

> +Then, using a single SEND message, they exchange this
> +structure with each other, to be used later during the
> +iteration of main memory. This structure includes a list
> +of all the RAMBlocks, their offsets and lengths.

This basically means that all memort on destination has to be registered
upfront.  A typical guest has gigabytes of memory, IMHO that's too much
memory to have pinned.

> +
> +Main memory is not migrated with SEND infiniband 
> +messages, but is instead migrated with RDMA infiniband
> +messages.
> +
> +Messages are migrated in "chunks" (about 64 pages right now).
> +Chunk size is not dynamic, but it could be in a future
> +implementation.
> +
> +When a total of 64 pages (or a flush()) are aggregated,
> +the memory backed by the chunk on the sender side is
> +registered with librdmacm and pinned in memory.
> +
> +After pinning, an RDMA send is generated and tramsmitted
> +for the entire chunk.

I think something chunk-based on the destination side is required
as well. You also can't trust the source to tell you
the chunk size it could be malicious and ask for too much.
Maybe source gives chunk size hint and destination responds
with what it wants to use.


> +Error-handling:
> +===============================
> +
> +Infiniband has what is called a "Reliable, Connected"
> +link (one of 4 choices). This is the mode in which
> +we use for RDMA migration.
> +
> +If a *single* message fails,
> +the decision is to abort the migration entirely and
> +cleanup all the RDMA descriptors and unregister all
> +the memory.
> +
> +After cleanup, the Virtual Machine is returned to normal
> +operation the same way that would happen if the TCP
> +socket is broken during a non-RDMA based migration.

Yes but we also need to report errors detected during migration.
Need to document how this is done.
We also need to report success.

> +
> +USAGE
> +===============================
> +
> +Compiling:
> +
> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
> +
> +$ make
> +
> +Command-line on the Source machine AND Destination:
> +
> +$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or 
> whatever is the MAX of your RDMA device
> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:
> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)
> +1. Average worst-case throughput 
> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> +
> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:
> +
> +http://wiki.qemu.org/Features/RDMALiveMigration
> -- 
> 1.7.10.4



reply via email to

[Prev in Thread] Current Thread [Next in Thread]