[Qemu-devel] [PATCH 8/8] rdma: add documentation

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PATCH 8/8] rdma: add documentation

From:	mrhines
Subject:	[Qemu-devel] [PATCH 8/8] rdma: add documentation
Date:	Fri, 12 Apr 2013 01:52:09 -0400

From: "Michael R. Hines" <address@hidden>

docs/rdma.txt contains full documentation,
wiki links, github url and contact information.

Signed-off-by: Michael R. Hines <address@hidden>
---
 docs/rdma.txt |  338 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 338 insertions(+)
 create mode 100644 docs/rdma.txt

diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 0000000..a122138
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,338 @@
+(RDMA: Remote Direct Memory Access)
+RDMA Live Migration Specification, Version # 1
+==============================================
+Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
+Github: address@hidden:hinesmr/qemu.git, 'rdma' branch
+
+Copyright (C) 2013 Michael R. Hines <address@hidden>
+
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki above.
+
+Contents:
+=========
+* Running
+* RDMA Protocol Description
+* Versioning and Capabilities
+* QEMUFileRDMA Interface
+* Migration of pc.ram
+* Error handling
+* TODO
+* Performance
+
+RUNNING:
+========
+
+First, decide if you want dynamic page registration on the server-side.
+The only reason to change this setting is if you have a super-fast RDMA
+link which could be fully utlized during the bulk-phase round of migration.
+NOTE: Disabling this will pin all of the memory on the destination, so
+if that is not what you want, then don't do it!
+
+QEMU Monitor Command:
+$ migrate_set_capability chunk_register_destination off # enabled by default
+
+Next, set the migration speed to match your hardware's capabilities:
+
+QEMU Monitor Command:
+$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
+
+Next, on the destination machine, add the following to the QEMU command line:
+
+qemu ..... -incoming rdma:host:port
+
+Finally, perform the actual migration:
+
+QEMU Monitor Command:
+$ migrate -d rdma:host:port
+
+RDMA Protocol Description:
+==========================
+
+Migration with RDMA is separated into two parts:
+
+1. The transmission of the pages using RDMA
+2. Everything else (a control channel is introduced)
+
+"Everything else" is transmitted using a formal
+protocol now, consisting of infiniband SEND messages.
+
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND messages cause notifications
+to be posted to the completion queue (CQ) on the
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+
+Messages in infiniband require two things:
+
+1. registration of the memory that will be transmitted
+2. (SEND only) work requests to be posted on both
+   sides of the network before the actual transmission
+   can occur.
+
+RDMA messages are much easier to deal with. Once the memory
+on the receiver side is registered and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
+
+(Memory is not released from pinning until the migration
+completes, given that RDMA migrations are very fast.)
+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a control transport for migration of device state.
+
+To begin the migration, the initial connection setup is
+as follows (migration-rdma.c):
+
+1. Receiver and Sender are started (command line or libvirt):
+2. Both sides post two RQ work requests
+3. Receiver does listen()
+4. Sender does connect()
+5. Receiver accept()
+6. Check versioning and capabilities (described later)
+
+At this point, we define a control channel on top of SEND messages
+which is described by a formal protocol. Each SEND message has a
+header portion and a data portion (but together are transmitted
+as a single SEND message).
+
+Header:
+    * Length  (of the data portion, uint32, network byte order)
+    * Type    (what command to perform, uint32, network byte order)
+    * Version (protocol version validated before send/recv occurs, uint32, 
network byte order
+    * Repeat  (Number of commands in data portion, same type only)
+
+The 'Repeat' field is here to support future multiple page registrations
+in a single message without any need to change the protocol itself
+so that the protocol is compatible against multiple versions of QEMU.
+Version #1 requires that all server implementations of the protocol must
+check this field and register all requests found in the array of commands 
located
+in the data portion and return an equal number of results in the response.
+The maximum number of repeats is hard-coded to 4096. This is a conservative
+limit based on the maximum size of a SEND message along with emperical
+observations on the maximum future benefit of simultaneous page registrations.
+
+The 'type' field has 7 different command values:
+    1. Unused
+    2. Ready                     (control-channel is available)
+    3. QEMU File                 (for sending non-live device state)
+    4. RAM Blocks                (used right after connection setup)
+    5. Register request          (dynamic chunk registration)
+    6. Register result           ('rkey' to be used by sender)
+    7. Register finished         (registration for current iteration finished)
+
+A single control message, as hinted above, can contain within the data
+portion an array of many commands of the same type. If there is more than
+one command, then the 'repeat' field will be greater than 1.
+
+After connection setup is completed, we have two protocol-level
+functions, responsible for communicating control-channel commands
+using the above list of values:
+
+Logically:
+
+qemu_rdma_exchange_recv(header, expected command type)
+
+1. We transmit a READY command to let the sender know that
+   we are *ready* to receive some data bytes on the control channel.
+2. Before attempting to receive the expected command, we post another
+   RQ work request to replace the one we just used up.
+3. Block on a CQ event channel and wait for the SEND to arrive.
+4. When the send arrives, librdmacm will unblock us.
+5. Verify that the command-type and version received matches the one we 
expected.
+
+qemu_rdma_exchange_send(header, data, optional response header & data):
+
+1. Block on the CQ event channel waiting for a READY command
+   from the receiver to tell us that the receiver
+   is *ready* for us to transmit some new bytes.
+2. Optionally: if we are expecting a response from the command
+   (that we have no yet transmitted), let's post an RQ
+   work request to receive that data a few moments later.
+3. When the READY arrives, librdmacm will
+   unblock us and we immediately post a RQ work request
+   to replace the one we just used up.
+4. Now, we can actually post the work request to SEND
+   the requested command type of the header we were asked for.
+5. Optionally, if we are expecting a response (as before),
+   we block again and wait for that response using the additional
+   work request we previously posted. (This is used to carry
+   'Register result' commands #6 back to the sender which
+   hold the rkey need to perform RDMA. Note that the virtual address
+   corresponding to this rkey was already exchanged at the beginning
+   of the connection (described below).
+
+All of the remaining command types (not including 'ready')
+described above all use the aformentioned two functions to do the hard work:
+
+1. After connection setup, RAMBlock information is exchanged using
+   this protocol before the actual migration begins. This information includes
+   a description of each RAMBlock on the server side as well as the virtual 
addresses
+   and lengths of each RAMBlock. This is used by the client to determine the
+   start and stop locations of chunks and how to register them dynamically
+   before performing the RDMA operations.
+2. During runtime, once a 'chunk' becomes full of pages ready to
+   be sent with RDMA, the registration commands are used to ask the
+   other side to register the memory for this chunk and respond
+   with the result (rkey) of the registration.
+3. Also, the QEMUFile interfaces also call these functions (described below)
+   when transmitting non-live state, such as devices or to send
+   its own protocol information during the migration process.
+
+Versioning and Capabilities
+===========================
+Current version of the protocol is version #1.
+
+The same version applies to both for protocol traffic and capabilities
+negotiation. (i.e. There is only one version number that is referred to
+by all communication).
+
+librdmacm provides the user with a 'private data' area to be exchanged
+at connection-setup time before any infiniband traffic is generated.
+
+Header:
+    * Version (protocol version validated before send/recv occurs), uint32, 
network byte order
+    * Flags   (bitwise OR of each capability), uint32, network byte order
+
+There is no data portion of this header right now, so there is
+no length field. The maximum size of the 'private data' section
+is only 192 bytes per the Infiniband specification, so it's not
+very useful for data anyway. This structure needs to remain small.
+
+This private data area is a convenient place to check for protocol
+versioning because the user does not need to register memory to
+transmit a few bytes of version information.
+
+This is also a convenient place to negotiate capabilities
+(like dynamic page registration).
+
+If the version is invalid, we throw an error.
+
+If the version is new, we only negotiate the capabilities that the
+requested version is able to perform and ignore the rest.
+
+Currently there is only *one* capability in Version #1: dynamic page 
registration
+
+Finally: Negotiation happens with the Flags field: If the primary-VM
+sets a flag, but the destination does not support this capability, it
+will return a zero-bit for that flag and the primary-VM will understand
+that as not being an available capability and will thus disable that
+capability on the primary-VM side.
+
+QEMUFileRDMA Interface:
+=======================
+
+QEMUFileRDMA introduces a couple of new functions:
+
+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+
+These two functions are very short and simply use the protocol
+describe above to deliver bytes without changing the upper-level
+users of QEMUFile that depend on a bytestream abstraction.
+
+Finally, how do we handoff the actual bytes to get_buffer()?
+
+Again, because we're trying to "fake" a bytestream abstraction
+using an analogy not unlike individual UDP frames, we have
+to hold on to the bytes received from control-channel's SEND
+messages in memory.
+
+Each time we receive a complete "QEMU File" control-channel
+message, the bytes from SEND are copied into a small local holding area.
+
+Then, we return the number of bytes requested by get_buffer()
+and leave the remaining bytes in the holding area until get_buffer()
+comes around for another pass.
+
+If the buffer is empty, then we follow the same steps
+listed above and issue another "QEMU File" protocol command,
+asking for a new SEND message to re-fill the buffer.
+
+Migration of pc.ram:
+====================
+
+At the beginning of the migration, (migration-rdma.c),
+the sender and the receiver populate the list of RAMBlocks
+to be registered with each other into a structure.
+Then, using the aforementioned protocol, they exchange a
+description of these blocks with each other, to be used later
+during the iteration of main memory. This description includes
+a list of all the RAMBlocks, their offsets and lengths, virtual
+addresses and possibly includes pre-registered RDMA keys in case dynamic
+page registration was disabled on the server-side, otherwise not.
+
+Main memory is not migrated with the aforementioned protocol,
+but is instead migrated with normal RDMA Write operations.
+
+Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
+Chunk size is not dynamic, but it could be in a future implementation.
+There's nothing to indicate that this is useful right now.
+
+When a chunk is full (or a flush() occurs), the memory backed by
+the chunk is registered with librdmacm is pinned in memory on
+both sides using the aforementioned protocol.
+After pinning, an RDMA Write is generated and transmitted
+for the entire chunk.
+
+Chunks are also transmitted in batches: This means that we
+do not request that the hardware signal the completion queue
+for the completion of *every* chunk. The current batch size
+is about 64 chunks (corresponding to 64 MB of memory).
+Only the last chunk in a batch must be signaled.
+This helps keep everything as asynchronous as possible
+and helps keep the hardware busy performing RDMA operations.
+
+Error-handling:
+===============
+
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
+
+TODO:
+=====
+1. Chunk server registration could be improved:
+   This can be done by holding chunks for a certain amount
+   of time and then register all of the chunks at the same
+   time using a fewer number of control messages. The
+   performance of this approach is unclear.
+   Current version of the protocol does not need to change
+   to support such an optimization.
+2. Currently, cgroups swap limits for *both* TCP and RDMA
+   on the sender-side is broken. This is more poignant for
+   RDMA because RDMA requires memory registration.
+   Fixing this requires infiniband page registrations to be
+   zero-page aware, and this does not yet work properly.
+4. Use of the recent /proc/<pid>/pagemap would likely solve some
+   of these problems.
+5. Also, some form of balloon-device usage tracking would also
+   help alleviate some of these issues.
+
+PERFORMANCE
+===========
+
+Using a 40gbps infinband link performing a worst-case stress test:
+
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 26 gpbs
+1. Average worst-case throughput
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+3. Using chunked registration: approximately 6 gbps.
+
+Average downtime (stop time) ranges between 15 and 33 milliseconds.
-- 
1.7.10.4

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 2/8] rdma: core rdma logic, mrhines, 2013/04/12
- [Qemu-devel] [PATCH 3/8] rdma: new QEMUFileOps hooks, mrhines, 2013/04/12
  - Re: [Qemu-devel] [PATCH 3/8] rdma: new QEMUFileOps hooks, Paolo Bonzini, 2013/04/12
    - Re: [Qemu-devel] [PATCH 3/8] rdma: new QEMUFileOps hooks, Michael R. Hines, 2013/04/12
- [Qemu-devel] [PATCH 4/8] rdma: implement new QEMUFileOps hooks, mrhines, 2013/04/12
  - Re: [Qemu-devel] [PATCH 4/8] rdma: implement new QEMUFileOps hooks, Paolo Bonzini, 2013/04/12
    - Re: [Qemu-devel] [PATCH 4/8] rdma: implement new QEMUFileOps hooks, Michael R. Hines, 2013/04/12
- [Qemu-devel] [PATCH 5/8] rdma: introduce capability for chunk registration, mrhines, 2013/04/12
- [Qemu-devel] [PATCH 6/8] rdma: send pc.ram, mrhines, 2013/04/12
- [Qemu-devel] [PATCH 7/8] rdma: print out throughput while debugging, mrhines, 2013/04/12
- [Qemu-devel] [PATCH 8/8] rdma: add documentation, mrhines <=
- Re: [Qemu-devel] [PATCH 2/8] rdma: core rdma logic, Paolo Bonzini, 2013/04/12
  - Re: [Qemu-devel] [PATCH 2/8] rdma: core rdma logic, Michael R. Hines, 2013/04/12
    - Re: [Qemu-devel] [PATCH 2/8] rdma: core rdma logic, Paolo Bonzini, 2013/04/12
    - Re: [Qemu-devel] [PATCH 2/8] rdma: core rdma logic, Michael R. Hines, 2013/04/12

Prev by Date: [Qemu-devel] [PATCH 7/8] rdma: print out throughput while debugging
Next by Date: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Previous by thread: [Qemu-devel] [PATCH 7/8] rdma: print out throughput while debugging
Next by thread: Re: [Qemu-devel] [PATCH 2/8] rdma: core rdma logic
Index(es):
- Date
- Thread