[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protoc

From: Michael R. Hines
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Date: Wed, 10 Apr 2013 09:04:44 -0400
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130106 Thunderbird/17.0.2

On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
Below is a great high level overview. the protocol looks correct.
A bit more detail would be helpful, as noted below.

The main thing I'd like to see changed is that there are already
two protocols here: chunk-based and non chunk based.
We'll need to use versioning and capabilities going forward but in the
first version we don't need to maintain compatibility with legacy so
two versions seems like unnecessary pain.  Chunk based is somewhat slower and
that is worth fixing longer term, but seems like the way forward. So
let's implement a single chunk-based protocol in the first version we

Some more minor improvement suggestions below.

However, IMHO restricting the policy to only used chunk-based is really
not an acceptable choice:

Here's the reason: Using my 10gbs RDMA hardware, throughput takes a dive from 10gbps to 6gbps.

But if I disable chunk-based registration altogether (forgoing overcommit), then performance comes back.

The reason for this is is the additional control trannel traffic needed to ask the server to register memory pages on demand - without this traffic, we can easily saturate the link.

But with this traffic, the user needs to know (and be given the option) to disable the feature
in case they want performance instead of flexibility.

On Mon, Apr 08, 2013 at 11:04:32PM -0400, address@hidden wrote:
From: "Michael R. Hines" <address@hidden>

Both the protocol and interfaces are elaborated in more detail,
including the new use of dynamic chunk registration, versioning,
and capabilities negotiation.

Signed-off-by: Michael R. Hines <address@hidden>
  docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
  1 file changed, 313 insertions(+)
  create mode 100644 docs/rdma.txt

diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 0000000..e9fa4cd
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,313 @@
+Several changes since v4:
+- Created a "formal" protocol for the RDMA control channel
+- Dynamic, chunked page registration now implemented on *both* the server and 
+- Created new 'capability' for page registration
+- Created new 'capability' for is_zero_page() - enabled by default
+  (needed to test dynamic page registration)
+- Created version-check before protocol begins at connection-time
+- no more migrate_use_rdma() !
+NOTE: While dynamic registration works on both sides now,
+      it does *not* work with cgroups swap limits. This functionality with 
+      remains broken. (It works fine with TCP). So, in order to take full
+      advantage of this feature, a fix will have to be developed on the kernel 
+      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
You mean the idea of using pagemap to detect shared pages created by KSM
and/or zero pages? That would be helpful for TCP migration, thanks!

Yes, absolutely. This would *also* help the above registration problem.

We could use this to *pre-register* pages in advance, but that would be
an entirely different patch series (which I'm willing to write and submit).

BTW the above comments belong outside both document and commit log,
after --- before diff.

+* Compiling
+* Running (please readme before running)
+* RDMA Protocol Description
+* Versioning
+* QEMUFileRDMA Interface
+* Migration of pc.ram
+* Error handling
+* Performance
+$ ./configure --enable-rdma --target-list=x86_64-softmmu
+$ make
+First, decide if you want dynamic page registration on the server-side.
+This always happens on the primary-VM side, but is optional on the server.
+Doing this allows you to support overcommit (such as cgroups or ballooning)
+with a smaller footprint on the server-side without having to register the
+entire VM memory footprint.
+NOTE: This significantly slows down performance (about 30% slower).
Where does the overhead come from? It appears from the description that
you have exactly same amount of data to exchange using send messages,
either way?
Or are you using bigger chunks with upfront registration?

Answer is above.

Upfront registration registers the entire VM before migration starts
where as dynamic registration (on both sides) registers chunks in
1 MB increments as they are requested by the migration_thread.

The extra send messages required to request the server to register
the memory means that the RDMA must block until those messages
complete before the RDMA can begin.

+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_capability chunk_register_destination on" # disabled by 
I think the right choice is to make chunk based the default, and remove
the non chunk based from code.  This will simplify the protocol a tiny bit,
and make us focus on improving chunk based long term so that it's as
fast as upfront registration.
Answer above.

+Next, if you decided *not* to use chunked registration on the server,
+it is recommended to also disable zero page detection. While this is not
+strictly necessary, zero page detection also significantly slows down
+performance on higher-throughput links (by about 50%), like 40 gbps infiniband 
What is meant by performance here? downtime?

Throughput. Zero page scanning (and dynamic registration) reduces throughput significantly.

+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_capability check_for_zero off" # always enabled by 
+Finally, set the migration speed to match your hardware's capabilities:
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
+Finally, perform the actual migration:
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+RDMA Protocol Description:
+Migration with RDMA is separated into two parts:
+1. The transmission of the pages using RDMA
+2. Everything else (a control channel is introduced)
+"Everything else" is transmitted using a formal
+protocol now, consisting of infiniband SEND / RECV messages.
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND message cause completion notifications
+to be posted to the completion queue (CQ) on the
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+Messages in infiniband require two things:
+1. registration of the memory that will be transmitted
+2. (SEND/RECV only) work requests to be posted on both
+   sides of the network before the actual transmission
+   can occur.
+RDMA messages much easier to deal with. Once the memory
+on the receiver side is registered and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
When is memory unregistered and unpinned on send and receive
Only when the migration ends completely. Will update the documentation.

+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+To begin the migration, the initial connection setup is
+as follows (migration-rdma.c):
+1. Receiver and Sender are started (command line or libvirt):
+2. Both sides post two RQ work requests
Okay this could be where the problem is. This means with chunk
based receive side does:

        receive request
        send response

while with non chunk based it does:

receive request
send response
No, that's incorrect. With "non" chunk based, the receive side does *not* communicate
during the migration of pc.ram.

The control channel is only used for chunk registration and device state, not RAM.

I will update the documentation to make that more clear.

In reality each request/response requires two network round-trips
with the Ready credit-management messsages.
So the overhead will likely be avoided if we add better pipelining:
allow multiple registration requests in the air, and add more
send/receive credits so the overhead of credit management can be
Unfortunately, the migration thread doesn't work that way.
The thread only generates one page write at-a-time.

If someone were to write a patch which submits multiple
writes at the same time, I would be very interested in
consuming that feature and making chunk registration more
efficient by batching multiple registrations into fewer messages.

There's no requirement to implement these optimizations upfront
before merging the first version, but let's remove the
non-chunkbased crutch unless we see it as absolutely necessary.

+3. Receiver does listen()
+4. Sender does connect()
+5. Receiver accept()
+6. Check versioning and capabilities (described later)
+At this point, we define a control channel on top of SEND messages
+which is described by a formal protocol. Each SEND message has a
+header portion and a data portion (but together are transmitted
+as a single SEND message).
+    * Length  (of the data portion)
+    * Type    (what command to perform, described below)
+    * Version (protocol version validated before send/recv occurs)
What's the expected value for Version field?
Also, confusing.  Below mentions using private field in librdmacm instead?
Need to add # of bytes and endian-ness of each field.

Correct, those are two separate versions. One for capability negotiation
and one for the protocol itself.

I will update the documentation.

+The 'type' field has 7 different command values:
0. Unused.

+    1. None
you mean this is unused?

Correct - will update.

+    2. Ready             (control-channel is available)
+    3. QEMU File         (for sending non-live device state)
+    4. RAM Blocks        (used right after connection setup)
+    5. Register request  (dynamic chunk registration)
+    6. Register result   ('rkey' to be used by sender)
Hmm, don't you also need a virtual address for RDMA writes?

The virtual addresses are communicated at the beginning of the
migration using command #4 "Ram blocks".

+    7. Register finished (registration for current iteration finished)
What does Register finished mean and how it's used?

Need to add which commands have a data portion, and in what format.

Acknowledged. "finished" signals that a migration round has completed
and that the receiver side can move to the next iteration.

+After connection setup is completed, we have two protocol-level
+functions, responsible for communicating control-channel commands
+using the above list of values:
+qemu_rdma_exchange_recv(header, expected command type)
+1. We transmit a READY command to let the sender know that
you call it Ready above, so better be consistent.

+   we are *ready* to receive some data bytes on the control channel.
+2. Before attempting to receive the expected command, we post another
+   RQ work request to replace the one we just used up.
+3. Block on a CQ event channel and wait for the SEND to arrive.
+4. When the send arrives, librdmacm will unblock us.
+5. Verify that the command-type and version received matches the one we 
+qemu_rdma_exchange_send(header, data, optional response header & data):
+1. Block on the CQ event channel waiting for a READY command
+   from the receiver to tell us that the receiver
+   is *ready* for us to transmit some new bytes.
+2. Optionally: if we are expecting a response from the command
+   (that we have no yet transmitted),
Which commands expect result? Only Register request?

Yes, only register. In the code, the command is #define RDMA_CONTROL_REGISTER_RESULT

let's post an RQ
+   work request to receive that data a few moments later.
+3. When the READY arrives, librdmacm will
+   unblock us and we immediately post a RQ work request
+   to replace the one we just used up.
+4. Now, we can actually post the work request to SEND
+   the requested command type of the header we were asked for.
+5. Optionally, if we are expecting a response (as before),
+   we block again and wait for that response using the additional
+   work request we previously posted. (This is used to carry
+   'Register result' commands #6 back to the sender which
+   hold the rkey need to perform RDMA.
+All of the remaining command types (not including 'ready')
+described above all use the aformentioned two functions to do the hard work:
+1. After connection setup, RAMBlock information is exchanged using
+   this protocol before the actual migration begins.
+2. During runtime, once a 'chunk' becomes full of pages ready to
+   be sent with RDMA, the registration commands are used to ask the
+   other side to register the memory for this chunk and respond
+   with the result (rkey) of the registration.
+3. Also, the QEMUFile interfaces also call these functions (described below)
+   when transmitting non-live state, such as devices or to send
+   its own protocol information during the migration process.
+librdmacm provides the user with a 'private data' area to be exchanged
+at connection-setup time before any infiniband traffic is generated.
+This is a convenient place to check for protocol versioning because the
+user does not need to register memory to transmit a few bytes of version
+This is also a convenient place to negotiate capabilities
+(like dynamic page registration).
This would be a good place to document the format of the
private data field.


+If the version is invalid, we throw an error.
Which version is valid in this specification?
Version 1. Will update.
+If the version is new, we only negotiate the capabilities that the
+requested version is able to perform and ignore the rest.
What are these capabilities and how do we negotiate them?
There is only one capability right now: dynamic server registration.

The client must tell the server whether or not the capability was
enabled or not on the primary VM side.

Will update the documentation.

+QEMUFileRDMA Interface:
+QEMUFileRDMA introduces a couple of new functions:
+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+These two functions are very short and simply used the protocol
+describe above to deliver bytes without changing the upper-level
+users of QEMUFile that depend on a bytstream abstraction.
+Finally, how do we handoff the actual bytes to get_buffer()?
+Again, because we're trying to "fake" a bytestream abstraction
+using an analogy not unlike individual UDP frames, we have
+to hold on to the bytes received from control-channel's SEND
+messages in memory.
+Each time we receive a complete "QEMU File" control-channel
+message, the bytes from SEND are copied into a small local holding area.
+Then, we return the number of bytes requested by get_buffer()
+and leave the remaining bytes in the holding area until get_buffer()
+comes around for another pass.
+If the buffer is empty, then we follow the same steps
+listed above and issue another "QEMU File" protocol command,
+asking for a new SEND message to re-fill the buffer.
+Migration of pc.ram:
+At the beginning of the migration, (migration-rdma.c),
+the sender and the receiver populate the list of RAMBlocks
+to be registered with each other into a structure.
+Then, using the aforementioned protocol, they exchange a
+description of these blocks with each other, to be used later
+during the iteration of main memory. This description includes
+a list of all the RAMBlocks, their offsets and lengths and
+possibly includes pre-registered RDMA keys in case dynamic
+page registration was disabled on the server-side, otherwise not.
Worth mentioning here that memory hotplug will require a protocol
extension. That's also true of TCP so not a big deal ...


+Main memory is not migrated with the aforementioned protocol,
+but is instead migrated with normal RDMA Write operations.
+Pages are migrated in "chunks" (about 1 Megabyte right now).
Why "about"? This is not dynamic so needs to be exactly same
on both sides, right?
About is a typo =). It is hard-coded to exactly 1MB.

+Chunk size is not dynamic, but it could be in a future implementation.
+There's nothing to indicate that this is useful right now.
+When a chunk is full (or a flush() occurs), the memory backed by
+the chunk is registered with librdmacm and pinned in memory on
+both sides using the aforementioned protocol.
+After pinning, an RDMA Write is generated and tramsmitted
+for the entire chunk.
+Chunks are also transmitted in batches: This means that we
+do not request that the hardware signal the completion queue
+for the completion of *every* chunk. The current batch size
+is about 64 chunks (corresponding to 64 MB of memory).
+Only the last chunk in a batch must be signaled.
+This helps keep everything as asynchronous as possible
+and helps keep the hardware busy performing RDMA operations.
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
That's on sender side? Presumably this means you respond to
completion with error?
  How does receive side know
migration is complete?

Yes, on the sender side.

Migration "completeness" logic has not changed in this patch series.

Pleas recall that the entire QEMUFile protocol is still
happening at the upper-level inside of savevm.c/arch_init.c.

+1. Currently, cgroups swap limits for *both* TCP and RDMA
+   on the sender-side is broken. This is more poignant for
+   RDMA because RDMA requires memory registration.
+   Fixing this requires infiniband page registrations to be
+   zero-page aware, and this does not yet work properly.
+2. Currently overcommit for the the *receiver* side of
+   TCP works, but not for RDMA. While dynamic page registration
+   *does* work, it is only useful if the is_zero_page() capability
+   is remained enabled (which it is by default).
+   However, leaving this capability turned on *significantly* slows
+   down the RDMA throughput, particularly on hardware capable
+   of transmitting faster than 10 gbps (such as 40gbps links).
+3. Use of the recent /dev/<pid>/pagemap would likely solve some
+   of these problems.
+4. Also, some form of balloon-device usage tracking would also
+   help aleviate some of these issues.
+Using a 40gbps infinband link performing a worst-case stress test:
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 30 gpbs (little better than the paper)
+1. Average worst-case throughput
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+Average downtime (stop time) ranges between 28 and 33 milliseconds.
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki:

reply via email to

[Prev in Thread] Current Thread [Next in Thread]