[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protoc

From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Date: Thu, 11 Apr 2013 16:48:20 +0300

On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
> On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> >On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> >Maybe we should just say "RDMA is incompatible with memory
> >overcommit" and be done with it then. But see below.
> >>I would like to propose a compromise:
> >>
> >>How about we *keep* the registration capability and leave it enabled
> >>by default?
> >>
> >>This gives management tools the ability to get performance if they want to,
> >>but also satisfies your requirements in case management doesn't know the
> >>feature exists - they will just get the default enabled?
> >Well unfortunately the "overcommit" feature as implemented seems useless
> >really.  Someone wants to migrate with RDMA but with low performance?
> >Why not migrate with TCP then?
> Answer below.
> >>Either way, I agree that the optimization would be very useful,
> >>but I disagree that it is possible for an optimized registration algorithm
> >>to perform *as well as* the case when there is no dynamic
> >>registration at all.
> >>
> >>The point is that dynamic registration *only* helps overcommitment.
> >>
> >>It does nothing for performance - and since that's true any optimizations
> >>that improve on dynamic registrations will always be sub-optimal to turning
> >>off dynamic registration in the first place.
> >>
> >>- Michael
> >So you've given up on it.  Question is, sub-optimal by how much?  And
> >where's the bottleneck?
> >
> >Let's do some math. Assume you send 16 bytes registration request and
> >get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
> >32/4096 < 1% transport overhead. Negligeable.
> >
> >Is it the source CPU then? But CPU on source is basically doing same
> >things as with pre-registration: you do not pin all memory on source.
> >
> >So it must be the destination CPU that does not keep up then?
> >But it has to do even less than the source CPU.
> >
> >I suggest one explanation: the protocol you proposed is inefficient.
> >It seems to basically do everything in a single thread:
> >get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> >There are two round-trips of send/receive here where you are not
> >going anything useful. Why not let migration proceed?
> >
> >Doesn't all of this sound worth checking before we give up?
> >
> First, let me remind you:
> Chunks are already doing this!
> Perhaps you don't fully understand how chunks work or perhaps I
> should be more verbose
> in the documentation. The protocol is already joining multiple pages into a
> single chunk without issuing any writes. It is only until the chunk
> is full that an
> actual page registration request occurs.

I think I got that at a high level.
But there is a stall between chunks. If you make chunks smaller,
but pipeline registration, then there will never be any stall.

> So, basically what you want to know is what happens if we *change*
> the chunk size
> dynamically?

What I wanted to know is where is performance going?
Why is chunk based slower? It's not the extra messages,
on the wire, these take up negligeable BW.

> Something like this:
> 1. Chunk = 1MB, what is the performance?
> 2. Chunk = 2MB, what is the performance?
> 3. Chunk = 4MB, what is the performance?
> 4. Chunk = 8MB, what is the performance?
> 5. Chunk = 16MB, what is the performance?
> 6. Chunk = 32MB, what is the performance?
> 7. Chunk = 64MB, what is the performance?
> 8. Chunk = 128MB, what is the performance?
> I'll get you a this table today. Expect an email soon.
> - Michael

reply via email to

[Prev in Thread] Current Thread [Next in Thread]