qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PULL 5/8] rdma: core rdma logic


From: Michael R. Hines
Subject: Re: [Qemu-devel] [PULL 5/8] rdma: core rdma logic
Date: Tue, 16 Apr 2013 23:27:02 -0400
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130106 Thunderbird/17.0.2

On 04/16/2013 12:49 AM, Paolo Bonzini wrote:
+#define RDMA_CHUNK_REGISTRATION
+#define RDMA_LAZY_CLIENT_REGISTRATION
Please drop these; no dead code.

These are important sender-side-only debugging flags.
I'll add an explicit "debug-only" comment.

+/* Do not merge data if larger than this. */
+#define RDMA_MERGE_MAX (4 * 1024 * 1024)
+#define RDMA_UNSIGNALED_SEND_MAX 64
+
+#define RDMA_REG_CHUNK_SHIFT 20 /* 1 MB */
+//#define RDMA_REG_CHUNK_SHIFT 21 /* 2 MB */
+//#define RDMA_REG_CHUNK_SHIFT 22 /* 4 MB */
+//#define RDMA_REG_CHUNK_SHIFT 23 /* 8 MB */
+//#define RDMA_REG_CHUNK_SHIFT 24 /* 16 MB */
+//#define RDMA_REG_CHUNK_SHIFT 25 /* 32 MB */
+//#define RDMA_REG_CHUNK_SHIFT 26 /* 64 MB */
+//#define RDMA_REG_CHUNK_SHIFT 27 /* 128 MB */
+//#define RDMA_REG_CHUNK_SHIFT 28 /* 256 MB */
IIUC this must be agreed upon by source and destination.  Either make it
part of the protocol (e.g. using the extra available bits in the
RAM_SAVE_HOOK 64-bit value), or drop the alternatives.

These are also important debugging flags.
I'll add an explicit "debug-only" comment.

(Chunk size is not optional in the protocol right now, actually,
as the data shows that it doesn't really improve performance much).

+#define RDMA_REG_CHUNK_START(rdma_ram_block, i) ((uint8_t *)\
+            ((((unsigned long)((rdma_ram_block)->local_host_addr) >> \
+                RDMA_REG_CHUNK_SHIFT) + (i)) << \
+                RDMA_REG_CHUNK_SHIFT))
This should be:

#define RDMA_REG_CHUNK_START(rdma_ram_block, i) ((uint8_t *)\
             (((uintptr_t)(rdma_ram_block)->local_host_addr) \
              + ((i) << RDMA_REG_CHUNK_SHIFT))

Otherwise, the length of the first chunk might be different on the two
sides.  You're probably not seeing because memory is aligned to 2MB by
default, but with bigger chunk sizes you might get it.

Ah, nice catch. Double-shifting is never good =)

+#define RDMA_BLOCKING
What's the difference?

Definitely dead code - will remove.

+/*
+ * Completion queue can be filled by both read and write work requests,
+ * so must reflect the sum of both possible queue sizes.
+ */
+#define RDMA_QP_SIZE 1000
+#define RDMA_CQ_SIZE (RDMA_QP_SIZE * 3)
+
+/*
+ * Maximum size infiniband SEND message
+ */
+#define RDMA_CONTROL_MAX_BUFFER (512 * 1024)
+#define RDMA_CONTROL_MAX_WR 2
+#define RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE 4096
+
+/*
+ * Capabilities for negotiation.
+ */
+#define RDMA_CAPABILITY_CHUNK_REGISTER 0x01
+#define RDMA_CAPABILITY_NEXT_FEATURE   0x02
Please drop RDMA_CAPABILITY_NEXT_FEATURE and fail migration if unknown
capabilities are transmitted.

"NEXT" not used in the protocol, it's just there to remind the author what
lines of code need to change in the years to come whenever the protocol
adds a new feature.

I'll add a comment to make it clear.

Failure already happens for unknown capabilities.


+
+    /* resolve the first address */
+    ret = rdma_resolve_addr(rdma->cm_id, NULL, res->ai_addr,
+            RDMA_RESOLVE_TIMEOUT_MS);
+    if (ret) {
+        fprintf(stderr, "could not resolve address %s\n", rdma->host);
+        goto err_resolve_get_addr;
+    }
+
+    qemu_rdma_dump_gid("client_resolve_addr", rdma->cm_id);
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "could not perform event_addr_resolved\n");
+        goto err_resolve_get_addr;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
+        fprintf(stderr, "result not equal to event_addr_resolved %s\n",
+                rdma_event_str(cm_event->event));
+        perror("rdma_resolve_addr");
+        rdma_ack_cm_event(cm_event);
+        goto err_resolve_get_addr;
+    }
+    rdma_ack_cm_event(cm_event);
Move the rdma_ack_cm_event before the if, so that you can have it just once.

Can't do this =) The ack() will internally free the memory.

+
+                remote->block[i].remote_rkey = local->block[i].mr->rkey;
+            }
+
+            remote->block[i].offset = local->block[i].offset;
+            remote->block[i].length = local->block[i].length;
+    }
+}
+
+/*
+ * Client then propogates the remote ram block descriptions to his local copy.
+ * Really, only the virtual addresses are useful, but we propogate everything
+ * anyway.
+ *
+ * If we're using dynamic registration on the server side (the default), then
+ * the 'rkeys' are not useful because we will re-ask for them later during
+ * runtime.
+ */
What if we are not using it?  Should the comment be more like:

/*
  * Client then propagates the remote ram block descriptions to his
  * local copy.  If we're using dynamic registration on the server side
  * (the default), only the virtual addresses are actually useful
  * because we will re-ask for the 'rkeys' later.  But for simplicity
  * we propagate everything anyway.
  */

Not quite right, but I do understand the confusion:

/*
 * The protocol uses two different sets of rkeys (mutually exclusive):
 * 1. One key to represent the virtual address of the entire ram block.
 *    (dynamic chunk registration disabled - pin everything with one key.)
 * 2. One to represent individual chunks within a ram block.
 *    (dynamic chunk registration enabled - pin individual chunks.)
 *
* Once the capability is successfully negotiated, the destination transmits
 * the keys to use (or sends them later) including the virtual addresses
* and then propagates the remote ram block descriptions to his local copy.
 */


?

This "if" is needed, but boy it is ugly! :)

I think it's better if you change RDMA_REG_CHUNK_* to static inline
functions.  This way, the clamping of RDMA_REG_CHUNK_END's return value
can be placed in the function itself.

=)


+        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
+                start_addr,
+                end_addr - start_addr,
+                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
+                            IBV_ACCESS_REMOTE_WRITE) : 0)
+                | IBV_ACCESS_REMOTE_READ);
Here my lack of IB-fu is showing, but...  I would have thought of
something like this

      (rkey ? IBV_ACCESS_REMOTE_WRITE : 0) |
      (lkey ? IBV_ACCESS_LOCAL_READ : 0)

i.e. on the source IB will read, on the destination IB will write on
behalf of the source?

Why are the flags like that?

=) It confused me too when I first started writing IB code:

LOCAL_WRITE is just a common practice - grant your
own code access on the same adapter.

LOCAL_READ doesn't exist =)

I also added REMOTE_READ because I didn't want to exclude
future protocols from introspecting the peer in case we wanted
to do something "fancy" in the future. (I don't know what fancy
would be, really, but I can certainly take it out if folks don't like it.)



+            if (!can_use_buffer_find_nonzero_offset((void *)sge.addr, length)) 
{
+                fprintf(stderr, "Chunk cannot be checked for zero!!!! "
+                    "%d for %d bytes, index: %d, offset: %" PRId64 "...\n",
+                    chunk, sge.length, current_index, offset);
+                return -EIO;
+            }
This shouldn't fail.  If it failed, it means that sge.addr is not
suitably aligned.  But if it fails, just proceed with registration.

That is, add

     can_use_buffer_find_nonzero_offset((void *)sge.addr, length) &&

to the "if" below.

Good catch.


+/*
+ * Push out any unwritten RDMA operations.
+ *
+ * We support sending out multiple chunks at the same time.
+ * Not all of them need to get signaled in the completion queue.
+ */
+static int qemu_rdma_write_flush(QEMUFile *f, RDMAContext *rdma)
+{
+    int ret;
+    enum ibv_send_flags flags = 0;
+
+    if (!rdma->current_length) {
+        return 0;
+    }
+    if (rdma->num_unsignaled_send >=
+            RDMA_UNSIGNALED_SEND_MAX) {
+        flags = IBV_SEND_SIGNALED;
+    }
+
+    while (1) {
+        ret = qemu_rdma_write_one(f, rdma,
+                rdma->current_index,
+                rdma->current_offset,
+                rdma->current_length,
+                RDMA_WRID_RDMA_WRITE, flags);
+        if (ret < 0) {
+            if (ret == ENOMEM) {
This should be -ENOMEM at the very least, but I'm not sure who returns
it.  Is it from IB?

Yes, it's directly from IB. (Thanks for catching the missing negative).



+                DPRINTF("send queue is full. wait a little....\n");
+                ret = wait_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
+                if (ret < 0) {
+                    fprintf(stderr, "rdma migration: failed to make "
+                                    "room in full send queue! %d\n", ret);
+                    return -EIO;
+                }
+            } else {
+                 fprintf(stderr, "rdma migration: write flush error! %d\n",
+                                                                ret);
+                 perror("write flush error");
+                 return -EIO;
+            }
+        } else {
+                break;
+        }
+    }
Goto works nicely here:

retry:
      ret = qemu_rdma_write_one(f, rdma,
               rdma->current_index,
               rdma->current_offset,
               rdma->current_length,
               RDMA_WRID_RDMA_WRITE, flags);
      if (ret < 0) {
          if (ret == -ENOMEM) {
              DPRINTF("send queue is full. wait a little....\n");
              ret = wait_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
              if (ret >= 0) {
                  goto retry;
              }
              if (ret < 0) {
                  fprintf(stderr, "rdma migration: failed to make "
                                  "room in full send queue! %d\n", ret);
                  return ret;
              }
          }
           perror("rdma migration: write flush error");
           return ret;
      }


cool. =)

+    if (remote_ram_blocks.remote_area) {
+        g_free(remote_ram_blocks.remote_area);
No "if" around g_free.

Does g_free already check for NULL?

+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        perror("rdma_get_cm_event after rdma_connect");
+        fprintf(stderr, "rdma migration: error connecting!");
+        rdma_ack_cm_event(cm_event);
I think this rdma_ack_cm_event is wrong, rdma_get_cm_event has failed.

It's correct. The library returns to you the reason for the failure,
but you still must call "ack" to free the memory that was allocated
by the library.

+        rdma_destroy_id(rdma->cm_id);
+        rdma->cm_id = 0;
+        goto err_rdma_client_connect;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        perror("rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect");
+        fprintf(stderr, "rdma migration: error connecting!");
+        rdma_ack_cm_event(cm_event);
+        rdma_destroy_id(rdma->cm_id);
+        rdma->cm_id = 0;
+        goto err_rdma_client_connect;
+    }
+
+    memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
+    network_to_caps(&cap);
+
+    /*
+     * Verify that the destination can support the capabilities we requested.
+     */
+    if (!(cap.flags & RDMA_CAPABILITY_CHUNK_REGISTER) &&
+        rdma->chunk_register_destination) {
+        printf("Server cannot support dynamic registration. Will disable\n");
Is it really "cannot support" or rather "has disabled"?  If so, this is
not needed, the printf below already prints the same info.  Just make it
like this:

    rdma->chunk_register_destination =
        !!(cap.flags & RDMA_CAPABILITY_CHUNK_REGISTER);


We can't only check the flag, we can only enable or disable the flag
if the user first requested the capability. Only if the user requested
the capability on the source do we validate that request with the
destination.

And only then if both the user and the destination agree to we finally
enable the capability.


+        rdma->chunk_register_destination = false;
+    }
+
+    printf("Chunk registration %s\n",
+        rdma->chunk_register_destination ? "enabled" : "disabled");
+
+    rdma_ack_cm_event(cm_event);
Move this above, before the "if (cm_event->event !=
RDMA_CM_EVENT_ESTABLISHED)".

You cannot do that because the "private_data" pointer is stored inside
the cm_event memory which will get freed too early when the ack()
is called.

+static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
+                   ram_addr_t block_offset, ram_addr_t offset, size_t size)
+{
+
+    qemu_ftell(f);
If this is a trick to call qemu_fflush, please export the function
instead, or alternatively call it from ram_control_save_page before
f->ops->save_page (and after testing that f->ops->save_page exists).

Yes, it was a trick to call flush =)

I would prefer to export the function, because we don't want to
flush the buffer too often, but only when the RDMA protocol requires it.

+            goto err_rdma_server_wait;
+    }
+
+    if (cap.version == RDMA_CONTROL_VERSION_1) {
+        if (cap.flags & RDMA_CAPABILITY_CHUNK_REGISTER) {
+            rdma->chunk_register_destination = true;
+        } else if (cap.flags & RDMA_CAPABILITY_NEXT_FEATURE) {
+            /* handle new capability */
+        }
As mentioned above, please drop this "else if".  But in general, this
"if" is useless.  Please replace it with an

     /* We only support one version, and we rejected all
      * others above.
      */
     assert(cap.version == RDMA_CONTROL_VERSION_CURRENT);


We don't want to kill QEMU with an assertion, do we?
Shouldn't we throw the error back to the user?


+void rdma_start_incoming_migration(const char *host_port, Error **errp)
+{
+    int ret;
+    RDMAContext *rdma;
+
+    DPRINTF("Starting RDMA-based incoming migration\n");
+    rdma = qemu_rdma_data_init(host_port, errp);
+    if (rdma == NULL) {
+        return;
+    }
+
+    ret = qemu_rdma_server_init(rdma, NULL);
+
+    if (!ret) {
+        DPRINTF("qemu_rdma_server_init success\n");
+        ret = qemu_rdma_server_prepare(rdma, NULL);
+
+        if (!ret) {
+            DPRINTF("qemu_rdma_server_prepare success\n");
+
+            qemu_set_fd_handler2(rdma->channel->fd, NULL,
I must have asked before, is it possible to use rdma->channel->fd or
something similar also in qemu_rdma_block_for_wrid, before calling
qemu_rdma_poll?

Yes, in my next patch, you will see a new version which uses
that file descriptor by calling <poll.h>'s poll().

Also, in the next patch, you will see much better error
handling of ibv_reg_mr() function as we agreed when the user sets "ulimit -l".

- Michael




reply via email to

[Prev in Thread] Current Thread [Next in Thread]