qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protoc


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Date: Thu, 11 Apr 2013 17:37:18 +0300

On Thu, Apr 11, 2013 at 09:58:50AM -0400, Michael R. Hines wrote:
> On 04/11/2013 09:48 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
> >>On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> >>>On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> >>>Maybe we should just say "RDMA is incompatible with memory
> >>>overcommit" and be done with it then. But see below.
> >>>>I would like to propose a compromise:
> >>>>
> >>>>How about we *keep* the registration capability and leave it enabled
> >>>>by default?
> >>>>
> >>>>This gives management tools the ability to get performance if they want 
> >>>>to,
> >>>>but also satisfies your requirements in case management doesn't know the
> >>>>feature exists - they will just get the default enabled?
> >>>Well unfortunately the "overcommit" feature as implemented seems useless
> >>>really.  Someone wants to migrate with RDMA but with low performance?
> >>>Why not migrate with TCP then?
> >>Answer below.
> >>
> >>>>Either way, I agree that the optimization would be very useful,
> >>>>but I disagree that it is possible for an optimized registration algorithm
> >>>>to perform *as well as* the case when there is no dynamic
> >>>>registration at all.
> >>>>
> >>>>The point is that dynamic registration *only* helps overcommitment.
> >>>>
> >>>>It does nothing for performance - and since that's true any optimizations
> >>>>that improve on dynamic registrations will always be sub-optimal to 
> >>>>turning
> >>>>off dynamic registration in the first place.
> >>>>
> >>>>- Michael
> >>>So you've given up on it.  Question is, sub-optimal by how much?  And
> >>>where's the bottleneck?
> >>>
> >>>Let's do some math. Assume you send 16 bytes registration request and
> >>>get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  
> >>>That's
> >>>32/4096 < 1% transport overhead. Negligeable.
> >>>
> >>>Is it the source CPU then? But CPU on source is basically doing same
> >>>things as with pre-registration: you do not pin all memory on source.
> >>>
> >>>So it must be the destination CPU that does not keep up then?
> >>>But it has to do even less than the source CPU.
> >>>
> >>>I suggest one explanation: the protocol you proposed is inefficient.
> >>>It seems to basically do everything in a single thread:
> >>>get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> >>>There are two round-trips of send/receive here where you are not
> >>>going anything useful. Why not let migration proceed?
> >>>
> >>>Doesn't all of this sound worth checking before we give up?
> >>>
> >>First, let me remind you:
> >>
> >>Chunks are already doing this!
> >>
> >>Perhaps you don't fully understand how chunks work or perhaps I
> >>should be more verbose
> >>in the documentation. The protocol is already joining multiple pages into a
> >>single chunk without issuing any writes. It is only until the chunk
> >>is full that an
> >>actual page registration request occurs.
> >I think I got that at a high level.
> >But there is a stall between chunks. If you make chunks smaller,
> >but pipeline registration, then there will never be any stall.
> 
> Pipelineing == chunking.

pipelining:
https://en.wikipedia.org/wiki/Pipeline_%28computing%29
chunking:
https://en.wikipedia.org/wiki/Chunking_%28computing%29

> You cannot eliminate the stall,
> that's impossible.

Sure, you can eliminate the stalls. Just hide them
behind data transfers. See a diagram below.


> You can *grow* the chunk size (i.e. the pipeline)
> to amortize the cost of the stall, but you cannot eliminate
> the stall at the end of the pipeline.
> 
> At some point you have to flush the pipeline (i.e. the chunk),
> whether you like it or not.

You can process many chunks in parallel. Make chunks smaller but process
them in a pipelined fashion.  Yes the pipe might stall but it won't if
receive side is as fast as send side, then you won't have to flush at
all.


> >>So, basically what you want to know is what happens if we *change*
> >>the chunk size
> >>dynamically?
> >What I wanted to know is where is performance going?
> >Why is chunk based slower? It's not the extra messages,
> >on the wire, these take up negligeable BW.
> 
> Answer above.


Here's how things are supposed to work in a pipeline:

req -> registration request
res -> response
done -> rdma done notification (remote can unregister)
pgX  -> page, or chunk, or whatever unit is used
        for registration
rdma -> one or more rdma write requests



pg1 ->  pin -> req -> res -> rdma -> done
        pg2 ->  pin -> req -> res -> rdma -> done
                pg3 -> pin -> req -> res -> rdma -> done
                       pg4 -> pin -> req -> res -> rdma -> done
                              pg4 -> pin -> req -> res -> rdma -> done



It's like a assembly line see?  So while software does the registration
roundtrip dance, hardware is processing rdma requests for previous
chunks.

....

When do you have to stall? when you run out of rx buffer credits so you
can not start a new req.  Your protocol has 2 outstanding buffers,
so you can only have one req in the air. Do more and
you will not need to stall - possibly at all.

One other minor point is that your protocol requires extra explicit
ready commands. You can pass the number of rx buffers as extra payload
in the traffic you are sending anyway, and reduce that overhead.

-- 
MST



reply via email to

[Prev in Thread] Current Thread [Next in Thread]