qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 1/2] migration/rdma: Fix out of order wrid


From: address@hidden
Subject: Re: [PATCH v2 1/2] migration/rdma: Fix out of order wrid
Date: Mon, 28 Jun 2021 07:13:37 +0000


On 25/06/2021 00:42, Dr. David Alan Gilbert wrote:
> * Li Zhijian (lizhijian@cn.fujitsu.com) wrote:
>> destination:
>> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
>> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
>> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
>> if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 
>> -device 
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
>> -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga 
>> qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming 
>> rdma:192.168.22.23:8888
>> qemu-system-x86_64: -spice 
>> streaming-video=filter,port=5902,disable-ticketing: warning: short-form 
>> boolean option 'disable-ticketing' deprecated
>> Please use disable-ticketing=on instead
>> QEMU 6.0.50 monitor - type 'help' for more information
>> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
>> (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name 
>> uverbs2, infiniband_verbs class device path 
>> /sys/class/infiniband_verbs/uverbs2, infiniband class device path 
>> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
>> qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got 
>> CONTROL RECV (4000)
>>
>> source:
>> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
>> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
>> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
>> if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device 
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
>> -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga 
>> qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
>> qemu-system-x86_64: -spice 
>> streaming-video=filter,port=5901,disable-ticketing: warning: short-form 
>> boolean option 'disable-ticketing' deprecated
>> Please use disable-ticketing=on instead
>> QEMU 6.0.50 monitor - type 'help' for more information
>> (qemu)
>> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
>> (qemu) migrate -d rdma:192.168.22.23:8888
>> source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device 
>> name uverbs2, infiniband_verbs class device path 
>> /sys/class/infiniband_verbs/uverbs2, infiniband class device path 
>> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
>> (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got 
>> CONTROL RECV (4000)
>>
>> NOTE: soft RoCE as the rdma device.
>> [root@iaas-rpma images]# rdma link show rxe_eth0/1
>> link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
>>
>> This migration cannot be completed since out of order(OOO) CQ event occurs.
>> OOO cases will occur in both source side and destination side. And it
>> happens on only SEND and RECV are out of order. OOO between 'WRITE RDMA' and
>> 'RECV' doesn't matter.
>>
>> below the OOO sequence:
>>        source                     destination
>>    qemu_rdma_write_one()          qemu_rdma_registration_handle()
>> 1.   post_recv X                 post_recv Y
>> 2.                               post_send X
>> 3.                               wait X CQ event
>> 4.   X CQ event
>> 5.   post_send Y
>> 6.   wait Y CQ event
>> 7.                               Y CQ event (dropped)
>> 8.   Y CQ event(send Y done)
>> 9.                               X CQ event(send X done)
>> 10.                                 wait Y CQ event(dropped at (7), blocks 
>> forever)
>>
>> Looks it only happens on soft RoCE rdma device in my a hundred of runs,
>> a hardware IB device works fine.
>>
>> Here we introduce a independent send completion queue to distinguish
>> ibv_post_send completion queue from the original mixed completion queue.
>> It helps us to poll the specific CQE we are really interesting in.
> Hi Li,
>    OK, it's a while since I've thought this much about completion, but I
> think that's OK, however, what stops the other messages, RDMA_WRITE and
> SEND_CONTROL being out of order?

Once either source or destination got below OOO wrid, both sides will wait for 
their FDs becoming
readable so that the migration will have no chance to be completed.
qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL 
RECV (4000)



>
>    Could this be fixed another way; make block_for_wrid record a flag for
> WRID's it's received, and then check (and clear) that flag right at the
> start?

I intent to do so like [1], but i think it's too tricky and hard to understand.

And I have consideration about:
- should we record a OOO in 'WRITE RDMA' and CONTROL RECV even if it doesn't 
matter in practice
- how many ooo_wrid we should record, I have observed  2 later WRs' CQ arrived 
earlier than
the wanted one.



[1]: 
https://lore.kernel.org/qemu-devel/162371118578.2358.12447251487494492434@7c66fb7bc3ab/T/#t

Thanks
Li

>
> Dave
>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>> V2 Introduce send completion queue
>> ---
>>   migration/rdma.c | 94 ++++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 79 insertions(+), 15 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index d90b29a4b51..16fe0688858 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -359,8 +359,10 @@ typedef struct RDMAContext {
>>       struct rdma_event_channel   *channel;
>>       struct ibv_qp *qp;                      /* queue pair */
>>       struct ibv_comp_channel *comp_channel;  /* completion channel */
>> +    struct ibv_comp_channel *send_comp_channel;  /* send completion channel 
>> */
>>       struct ibv_pd *pd;                      /* protection domain */
>>       struct ibv_cq *cq;                      /* completion queue */
>> +    struct ibv_cq *send_cq;                 /* send completion queue */
>>   
>>       /*
>>        * If a previous write failed (perhaps because of a failed
>> @@ -1067,8 +1069,7 @@ static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
>>       }
>>   
>>       /*
>> -     * Completion queue can be filled by both read and write work requests,
>> -     * so must reflect the sum of both possible queue sizes.
>> +     * Completion queue can be filled by read work requests.
>>        */
>>       rdma->cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
>>               NULL, rdma->comp_channel, 0);
>> @@ -1077,6 +1078,20 @@ static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
>>           goto err_alloc_pd_cq;
>>       }
>>   
>> +    /* create send completion channel */
>> +    rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
>> +    if (!rdma->send_comp_channel) {
>> +        error_report("failed to allocate completion channel");
>> +        goto err_alloc_pd_cq;
>> +    }
>> +
>> +    rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
>> +                                  NULL, rdma->send_comp_channel, 0);
>> +    if (!rdma->send_cq) {
>> +        error_report("failed to allocate completion queue");
>> +        goto err_alloc_pd_cq;
>> +    }
>> +
>>       return 0;
>>   
>>   err_alloc_pd_cq:
>> @@ -1086,8 +1101,16 @@ err_alloc_pd_cq:
>>       if (rdma->comp_channel) {
>>           ibv_destroy_comp_channel(rdma->comp_channel);
>>       }
>> +    if (rdma->send_comp_channel) {
>> +        ibv_destroy_comp_channel(rdma->send_comp_channel);
>> +    }
>> +    if (rdma->cq) {
>> +        ibv_destroy_cq(rdma->cq);
>> +        rdma->cq = NULL;
>> +    }
>>       rdma->pd = NULL;
>>       rdma->comp_channel = NULL;
>> +    rdma->send_comp_channel = NULL;
>>       return -1;
>>   
>>   }
>> @@ -1104,7 +1127,7 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>>       attr.cap.max_recv_wr = 3;
>>       attr.cap.max_send_sge = 1;
>>       attr.cap.max_recv_sge = 1;
>> -    attr.send_cq = rdma->cq;
>> +    attr.send_cq = rdma->send_cq;
>>       attr.recv_cq = rdma->cq;
>>       attr.qp_type = IBV_QPT_RC;
>>   
>> @@ -1420,14 +1443,14 @@ static void qemu_rdma_signal_unregister(RDMAContext 
>> *rdma, uint64_t index,
>>    * (of any kind) has completed.
>>    * Return the work request ID that completed.
>>    */
>> -static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
>> -                               uint32_t *byte_len)
>> +static uint64_t qemu_rdma_poll(RDMAContext *rdma, struct ibv_cq *cq,
>> +                              uint64_t *wr_id_out, uint32_t *byte_len)
>>   {
>>       int ret;
>>       struct ibv_wc wc;
>>       uint64_t wr_id;
>>   
>> -    ret = ibv_poll_cq(rdma->cq, 1, &wc);
>> +    ret = ibv_poll_cq(cq, 1, &wc);
>>   
>>       if (!ret) {
>>           *wr_id_out = RDMA_WRID_NONE;
>> @@ -1499,7 +1522,8 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, 
>> uint64_t *wr_id_out,
>>   /* Wait for activity on the completion channel.
>>    * Returns 0 on success, none-0 on error.
>>    */
>> -static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>> +static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
>> +                                       struct ibv_comp_channel *ch)
>>   {
>>       struct rdma_cm_event *cm_event;
>>       int ret = -1;
>> @@ -1510,7 +1534,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext 
>> *rdma)
>>        */
>>       if (rdma->migration_started_on_destination &&
>>           migration_incoming_get_current()->state == 
>> MIGRATION_STATUS_ACTIVE) {
>> -        yield_until_fd_readable(rdma->comp_channel->fd);
>> +        yield_until_fd_readable(ch->fd);
>>       } else {
>>           /* This is the source side, we're in a separate thread
>>            * or destination prior to migration_fd_process_incoming()
>> @@ -1521,7 +1545,7 @@ static int qemu_rdma_wait_comp_channel(RDMAContext 
>> *rdma)
>>            */
>>           while (!rdma->error_state  && !rdma->received_error) {
>>               GPollFD pfds[2];
>> -            pfds[0].fd = rdma->comp_channel->fd;
>> +            pfds[0].fd = ch->fd;
>>               pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>               pfds[0].revents = 0;
>>   
>> @@ -1579,6 +1603,17 @@ static int qemu_rdma_wait_comp_channel(RDMAContext 
>> *rdma)
>>       return rdma->error_state;
>>   }
>>   
>> +static struct ibv_comp_channel *to_channel(RDMAContext *rdma, int wrid)
>> +{
>> +    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_comp_channel :
>> +           rdma->comp_channel;
>> +}
>> +
>> +static struct ibv_cq *to_cq(RDMAContext *rdma, int wrid)
>> +{
>> +    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_cq : rdma->cq;
>> +}
>> +
>>   /*
>>    * Block until the next work request has completed.
>>    *
>> @@ -1599,13 +1634,15 @@ static int qemu_rdma_block_for_wrid(RDMAContext 
>> *rdma, int wrid_requested,
>>       struct ibv_cq *cq;
>>       void *cq_ctx;
>>       uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
>> +    struct ibv_comp_channel *ch = to_channel(rdma, wrid_requested);
>> +    struct ibv_cq *poll_cq = to_cq(rdma, wrid_requested);
>>   
>> -    if (ibv_req_notify_cq(rdma->cq, 0)) {
>> +    if (ibv_req_notify_cq(poll_cq, 0)) {
>>           return -1;
>>       }
>>       /* poll cq first */
>>       while (wr_id != wrid_requested) {
>> -        ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
>> +        ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
>>           if (ret < 0) {
>>               return ret;
>>           }
>> @@ -1626,12 +1663,12 @@ static int qemu_rdma_block_for_wrid(RDMAContext 
>> *rdma, int wrid_requested,
>>       }
>>   
>>       while (1) {
>> -        ret = qemu_rdma_wait_comp_channel(rdma);
>> +        ret = qemu_rdma_wait_comp_channel(rdma, ch);
>>           if (ret) {
>>               goto err_block_for_wrid;
>>           }
>>   
>> -        ret = ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx);
>> +        ret = ibv_get_cq_event(ch, &cq, &cq_ctx);
>>           if (ret) {
>>               perror("ibv_get_cq_event");
>>               goto err_block_for_wrid;
>> @@ -1645,7 +1682,7 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, 
>> int wrid_requested,
>>           }
>>   
>>           while (wr_id != wrid_requested) {
>> -            ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
>> +            ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
>>               if (ret < 0) {
>>                   goto err_block_for_wrid;
>>               }
>> @@ -2365,10 +2402,18 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>>           ibv_destroy_cq(rdma->cq);
>>           rdma->cq = NULL;
>>       }
>> +    if (rdma->send_cq) {
>> +        ibv_destroy_cq(rdma->send_cq);
>> +        rdma->send_cq = NULL;
>> +    }
>>       if (rdma->comp_channel) {
>>           ibv_destroy_comp_channel(rdma->comp_channel);
>>           rdma->comp_channel = NULL;
>>       }
>> +    if (rdma->send_comp_channel) {
>> +        ibv_destroy_comp_channel(rdma->send_comp_channel);
>> +        rdma->send_comp_channel = NULL;
>> +    }
>>       if (rdma->pd) {
>>           ibv_dealloc_pd(rdma->pd);
>>           rdma->pd = NULL;
>> @@ -3041,9 +3086,13 @@ static void 
>> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>>       if (io_read) {
>>           aio_set_fd_handler(ctx, rioc->rdmain->comp_channel->fd,
>>                              false, io_read, io_write, NULL, opaque);
>> +        aio_set_fd_handler(ctx, rioc->rdmain->send_comp_channel->fd,
>> +                           false, io_read, io_write, NULL, opaque);
>>       } else {
>>           aio_set_fd_handler(ctx, rioc->rdmaout->comp_channel->fd,
>>                              false, io_read, io_write, NULL, opaque);
>> +        aio_set_fd_handler(ctx, rioc->rdmaout->send_comp_channel->fd,
>> +                           false, io_read, io_write, NULL, opaque);
>>       }
>>   }
>>   
>> @@ -3256,7 +3305,22 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void 
>> *opaque,
>>        */
>>       while (1) {
>>           uint64_t wr_id, wr_id_in;
>> -        int ret = qemu_rdma_poll(rdma, &wr_id_in, NULL);
>> +        int ret = qemu_rdma_poll(rdma, rdma->cq, &wr_id_in, NULL);
>> +        if (ret < 0) {
>> +            error_report("rdma migration: polling error! %d", ret);
>> +            goto err;
>> +        }
>> +
>> +        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
>> +
>> +        if (wr_id == RDMA_WRID_NONE) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    while (1) {
>> +        uint64_t wr_id, wr_id_in;
>> +        int ret = qemu_rdma_poll(rdma, rdma->send_cq, &wr_id_in, NULL);
>>           if (ret < 0) {
>>               error_report("rdma migration: polling error! %d", ret);
>>               goto err;
>> -- 
>> 2.31.1
>>
>>
>>

reply via email to

[Prev in Thread] Current Thread [Next in Thread]