Re: [Qemu-devel] An RDMA race?

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] An RDMA race?

From:	Michael R. Hines
Subject:	Re: [Qemu-devel] An RDMA race?
Date:	Sat, 9 Jan 2016 05:03:05 -0600
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0

I don't mind ACKing this change if we could agree on some kind ofregression test for this. (I have an RDMA card at home that I could runtests on if need be).

The way that virt-test goes about this is not sufficient. The way I dotesting for RDMA is that I not only confirm thatthe migration succeded or failed, but I actually compare serial consoleoutput for funny keywords, like "panic"and so forth to make a poor-man's attempt guess at whether or not therewas any memory corruption.

Do you have a testing harness for yourself? (I'd also like to know whatthe COLO guys are doing).


Maybe we can coalesce around something?

- Michael

On 01/04/2016 12:15 PM, Dr. David Alan Gilbert wrote:

* Michael R. Hines (address@hidden) wrote:

Adding such a control message would defeat the benefits of RDMA, as there
shouldn't be any signalling in the actual DMA path, or RDMA latency would
be too high. If you're sending control messages for individual writes, then
you need to change up your design. It's OK to design ACKs for groups of
writes, depending on the requirements.

I started off with sending individual messages, and then once I had it working
I made it group them to send one message every 2048 pages.
The performance isn't very good though, and I've not yet analysed why.

So, the out-of-order issue you're seeing is only with your new message, not
the original messages?

Yes I believe they're only on the new messages; however:
   1) I'm sending a lot more control messages, so if there's a race I'm
     a lot more likely to trigger it. (I'm not sure I'm triggering it in the
     case where I group those 2048 together) - so does this mean it would
     occasionally trigger on the unmodified code?

   2) My reading of the existing code is that I think it could happen;
     a) the source is ready to send something and is waiting for a 
CONTROL_READY,
     b) the destination sends the CONTROL_READY
         (blocking in qemu_rdma_post_send_control call to
          qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL)
     c) The source sends it's data
     d) That arrives at the destination
     e) finally the WRID_SEND_CONTROL arrives back

    It's having d/e the wrong way round which is the race I think I'm seeing
    and then we lose (d)'s data.

Can you describe/document it in more detail so I can help advise?

There are 2 cases where the destination needs to know which pages it's received:
   i) In COLO or checkpointing where it's receiving a partial new checkpoint;
     since it's only receiving a partial checkpoint it needs to know what it's
     received. This allows the destination to avoid copying the whole of it's
     received checkpoint and only copy the bits that changed.

  ii) On postcopy once a page is received by the destination the page has to
     be atomically placed;  I've not thought too hard about that yet.

Dave

- Michael

On Mon, Dec 14, 2015 at 6:53 PM, Dr. David Alan Gilbert <address@hidden

wrote:
* Michael R. Hines (address@hidden) wrote:

David,

Thanks for including my email directly. It helps a lot.

Below, I'm going to assume that only "dest" is calling
qemu_rdma_exchange_recv()
and only src is calling qemu_rdma_exchange_send(), since you didn't

specify

who
is sending and who is receiving.

If that assumption is wrong, please respond again.

That's correct.

Comments inline.....

On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert <

address@hidden

wrote:
Hi Michael,
    I think I've got an RDMA race condition, but I'm being a little
cautious at the moment and wondered if you agree with the following
diagnosis.

It's showing up in a world of mine that's sending more control messages
from the destination->source and I'm seeing the following.

We normally expect:

    src                        dest
      ----------->control ready->

If src is sending, this is not correct. Dest should send the ready

message

if it is receiving, not src, which breaks the above assumption. So, I'll
reverse the assumption previously and continue with your observation and
assume that src is receiving instead of dest, which should instead look
like:

Gah! Yes, I got the label the wrong way around; it's dest sending control
ready.

src  (receiving)                      dest (sending)
      ----------->control ready->

    Sees SEND_CONTROL signal to ack that it has been sent

I'll assume here that you meant that dest sees the ready message and is
then later sends something.

          <-----control message--
    Sees RECV_CONTROL message from dest

Similar assumption for the receiver (src).

but what I'm seeing is:
    src                        dest
      ----------->control ready->
          <-----control message--
    Sees RECV_CONTROL message from dest

hmmmmm....

    Sees SEND_CONTROL signal to ack that it has been sent

There's not enough information here....... do you have a multi-threaded
send or receive or something?

No, I've been trying to wire RDMA into the COLO fault-tolerant setup;
so the change which got me to trigger this bug was that I'd
added a new control message 'notify write' which explicitly
told the destination it had a page written to; at the RDMA level
that was the only change.

Do the work request IDs match up?

Yes I think so; I also added a sequence number to the 'ready' messages
to check I wasn't losing one.
I had a chat to one of our RDMA guys (Doug Ledford) and he said
it's perfectly legal for RDMA to take longer to return the signal
from the send than for the round trip of the destination responding;
the 'signal' doesn't happen until an ack has been received from the
destination card anyway, so the ack can get delayed or retried.
So I think we do need to fix this; the question then is how do we fix
it for all control messages without breaking anything else.   Are there
any cases that rely on having received the signal from the send before
continuing, or could i just do what I'm doing for all control messages?

Dave

- Michael

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



--
/*
  * Michael R. Hines
  * https://michael.hinespot.com
  */

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK


--
/*
 * Michael R. Hines
 * Platform Engineer, DigitalOcean.
 */

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] An RDMA race?, Dr. David Alan Gilbert, 2016/01/04
- Re: [Qemu-devel] An RDMA race?, Michael R. Hines <=

Prev by Date: Re: [Qemu-devel] [PATCH v4 2/5] Add Error **errp for xen_host_pci_device_get()
Next by Date: Re: [Qemu-devel] [PATCH v4 2/5] Add Error **errp for xen_host_pci_device_get()
Previous by thread: Re: [Qemu-devel] An RDMA race?
Next by thread: Re: [Qemu-devel] [PATCH 2/2] qemu-nbd: Minor texi updates
Index(es):
- Date
- Thread