qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PULL 5/8] rdma: core rdma logic


From: Paolo Bonzini
Subject: Re: [Qemu-devel] [PULL 5/8] rdma: core rdma logic
Date: Wed, 17 Apr 2013 10:58:04 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130311 Thunderbird/17.0.4

Il 17/04/2013 05:27, Michael R. Hines ha scritto:
> On 04/16/2013 12:49 AM, Paolo Bonzini wrote:
>> +#define RDMA_CHUNK_REGISTRATION
>> +#define RDMA_LAZY_CLIENT_REGISTRATION
>> Please drop these; no dead code.
> 
> These are important sender-side-only debugging flags.
> I'll add an explicit "debug-only" comment.

No, I really mean it.  No dead code.  We included the one useful option
as a capability, and that should be it.

>>> +/* Do not merge data if larger than this. */
>>> +#define RDMA_MERGE_MAX (4 * 1024 * 1024)
>>> +#define RDMA_UNSIGNALED_SEND_MAX 64
>>> +
>>> +#define RDMA_REG_CHUNK_SHIFT 20 /* 1 MB */
>>> +//#define RDMA_REG_CHUNK_SHIFT 21 /* 2 MB */
>>> +//#define RDMA_REG_CHUNK_SHIFT 22 /* 4 MB */
>>> +//#define RDMA_REG_CHUNK_SHIFT 23 /* 8 MB */
>>> +//#define RDMA_REG_CHUNK_SHIFT 24 /* 16 MB */
>>> +//#define RDMA_REG_CHUNK_SHIFT 25 /* 32 MB */
>>> +//#define RDMA_REG_CHUNK_SHIFT 26 /* 64 MB */
>>> +//#define RDMA_REG_CHUNK_SHIFT 27 /* 128 MB */
>>> +//#define RDMA_REG_CHUNK_SHIFT 28 /* 256 MB */
>> IIUC this must be agreed upon by source and destination.  Either make it
>> part of the protocol (e.g. using the extra available bits in the
>> RAM_SAVE_HOOK 64-bit value), or drop the alternatives.
> 
> These are also important debugging flags.
> I'll add an explicit "debug-only" comment.

Please drop the commented lines.

>>> +#define RDMA_BLOCKING
>> What's the difference?
> 
> Definitely dead code - will remove.

Ok. :)
>>> +/*
>>> + * Capabilities for negotiation.
>>> + */

>>> +#define RDMA_CAPABILITY_CHUNK_REGISTER 0x01
>>> +#define RDMA_CAPABILITY_NEXT_FEATURE   0x02
>> Please drop RDMA_CAPABILITY_NEXT_FEATURE and fail migration if unknown
>> capabilities are transmitted.
> 
> "NEXT" not used in the protocol, it's just there to remind the author what
> lines of code need to change in the years to come whenever the protocol
> adds a new feature.
> 
> I'll add a comment to make it clear.

There is already a comment, and people can just search for
RDMA_CAPABILITY_CHUNK_REGISTER.  We do that all the time.
Please listen to review comments.

> Failure already happens for unknown capabilities.

Maybe I misread the code, but all I saw is:

+        if (cap.flags & RDMA_CAPABILITY_CHUNK_REGISTER) {
+            rdma->chunk_register_destination = true;
+        } else if (cap.flags & RDMA_CAPABILITY_NEXT_FEATURE) {
+            /* handle new capability */
+        }

and no failure.  I may well be wrong of course.

>> Move the rdma_ack_cm_event before the if, so that you can have it just
>> once.
> 
> Can't do this =) The ack() will internally free the memory.

Oh, ok.  Sorry for the noise.

> Not quite right, but I do understand the confusion:
> 
> /*
>  * The protocol uses two different sets of rkeys (mutually exclusive):
>  * 1. One key to represent the virtual address of the entire ram block.
>  *    (dynamic chunk registration disabled - pin everything with one key.)
>  * 2. One to represent individual chunks within a ram block.
>  *    (dynamic chunk registration enabled - pin individual chunks.)
>  *
>  * Once the capability is successfully negotiated, the destination
> transmits
>  * the keys to use (or sends them later) including the virtual addresses
>  * and then propagates the remote ram block descriptions to his local copy.
>  */

Good.

>>> This "if" is needed, but boy it is ugly! :)
>>>
>>> I think it's better if you change RDMA_REG_CHUNK_* to static inline
>>> functions.  This way, the clamping of RDMA_REG_CHUNK_END's return value
>>> can be placed in the function itself.
> 
> =)

I take it as a "yes". :)

>>
>>> +        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
>>> +                start_addr,
>>> +                end_addr - start_addr,
>>> +                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
>>> +                            IBV_ACCESS_REMOTE_WRITE) : 0)
>>> +                | IBV_ACCESS_REMOTE_READ);
>> Here my lack of IB-fu is showing, but...  I would have thought of
>> something like this
>>
>>       (rkey ? IBV_ACCESS_REMOTE_WRITE : 0) |
>>       (lkey ? IBV_ACCESS_LOCAL_READ : 0)
>>
>> i.e. on the source IB will read, on the destination IB will write on
>> behalf of the source?
>>
>> Why are the flags like that?
> 
> =) It confused me too when I first started writing IB code:
> 
> LOCAL_WRITE is just a common practice - grant your
> own code access on the same adapter.
> 
> LOCAL_READ doesn't exist =)
> 
> I also added REMOTE_READ because I didn't want to exclude
> future protocols from introspecting the peer in case we wanted
> to do something "fancy" in the future. (I don't know what fancy
> would be, really, but I can certainly take it out if folks don't like it.)

Hmm, please leave it out.

>>> +    if (remote_ram_blocks.remote_area) {
>>> +        g_free(remote_ram_blocks.remote_area);
>> No "if" around g_free.
> 
> Does g_free already check for NULL?

Yes.

>> +    ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> +    if (ret) {
>> +        perror("rdma_get_cm_event after rdma_connect");
>> +        fprintf(stderr, "rdma migration: error connecting!");
>> +        rdma_ack_cm_event(cm_event);
>> I think this rdma_ack_cm_event is wrong, rdma_get_cm_event has failed.
> 
> It's correct. The library returns to you the reason for the failure,
> but you still must call "ack" to free the memory that was allocated
> by the library.

Please double check, IIRC in some cases you are missing it, and thanks
for teaching me. :)

>> Is it really "cannot support" or rather "has disabled"?  If so, this is
>> not needed, the printf below already prints the same info.  Just make it
>> like this:
>>
>>     rdma->chunk_register_destination =
>>         !!(cap.flags & RDMA_CAPABILITY_CHUNK_REGISTER);
>>
> 
> We can't only check the flag, we can only enable or disable the flag
> if the user first requested the capability. Only if the user requested
> the capability on the source do we validate that request with the
> destination.
> 
> And only then if both the user and the destination agree to we finally
> enable the capability.

Just make sure the value of the capability on the destination is
disregarded.  All that matters is the value on the source (which is sent
on the wire).

> 
>>> +        rdma->chunk_register_destination = false;
>>> +    }
>>> +
>>> +    printf("Chunk registration %s\n",
>>> +        rdma->chunk_register_destination ? "enabled" : "disabled");
>>> +
>>> +    rdma_ack_cm_event(cm_event);
>> Move this above, before the "if (cm_event->event !=
>> RDMA_CM_EVENT_ESTABLISHED)".
> 
> You cannot do that because the "private_data" pointer is stored inside
> the cm_event memory which will get freed too early when the ack()
> is called.

Yep, understood now.

>>> +static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
>>> +                   ram_addr_t block_offset, ram_addr_t offset,
>>> size_t size)
>>> +{
>>> +
>>> +    qemu_ftell(f);
>> If this is a trick to call qemu_fflush, please export the function
>> instead, or alternatively call it from ram_control_save_page before
>> f->ops->save_page (and after testing that f->ops->save_page exists).
> 
> Yes, it was a trick to call flush =)
> 
> I would prefer to export the function, because we don't want to
> flush the buffer too often, but only when the RDMA protocol requires it.

Ok, then export it please.

>> +            goto err_rdma_server_wait;
>> +    }
>> +
>> +    if (cap.version == RDMA_CONTROL_VERSION_1) {
>> +        if (cap.flags & RDMA_CAPABILITY_CHUNK_REGISTER) {
>> +            rdma->chunk_register_destination = true;
>> +        } else if (cap.flags & RDMA_CAPABILITY_NEXT_FEATURE) {
>> +            /* handle new capability */
>> +        }
>> As mentioned above, please drop this "else if".  But in general, this
>> "if" is useless.  Please replace it with an
>>
>>      /* We only support one version, and we rejected all
>>       * others above.
>>       */
>>      assert(cap.version == RDMA_CONTROL_VERSION_CURRENT);
> 
> We don't want to kill QEMU with an assertion, do we?
> Shouldn't we throw the error back to the user?

You already filtered versions < min and > current with a decent error,
so the only reamining version at this point is 1.  You know cap.version
== RDMA_CONTROL_VERSION_CURRENT ("we rejected all others above").

> Yes, in my next patch, you will see a new version which uses
> that file descriptor by calling <poll.h>'s poll().

Ok.  Then we can improve it to support coroutines, too, so that the
monitor remains responsive on the incoming side.  The idea is that if
poll says the file descriptor is not available you do
yield_until_fd_readable(fd) where yield_until_fd_readable is currently
static in qemu-file.c but could be made public and moved to
qemu-coroutine-io.c and include/qemu-common.h:

    typedef struct {
        Coroutine *co;
        int fd;
    } FDYieldUntilData;

    static void fd_coroutine_enter(void *opaque)
    {
        FDYieldUntilData *data = opaque;
        qemu_set_fd_handler(data->fd, NULL, NULL, NULL);
        qemu_coroutine_enter(data->co, NULL);
    }

    /**
     * Yield until a file descriptor becomes readable
     *
     * Note that this function clobbers the handlers for the file
descriptor.
     */
    static void coroutine_fn yield_until_fd_readable(int fd)
    {
        FDYieldUntilData data;

        assert(qemu_in_coroutine());
        data.co = qemu_coroutine_self();
        data.fd = fd;
        qemu_set_fd_handler(fd, fd_coroutine_enter, NULL, &data);
        qemu_coroutine_yield();
    }

Please feel free to include such a patch (separate from this one) in
your next submission.

Paolo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]