qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protoc


From: Michael R. Hines
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Date: Thu, 11 Apr 2013 09:58:50 -0400
User-agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130106 Thunderbird/17.0.2

On 04/11/2013 09:48 AM, Michael S. Tsirkin wrote:
On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
Maybe we should just say "RDMA is incompatible with memory
overcommit" and be done with it then. But see below.
I would like to propose a compromise:

How about we *keep* the registration capability and leave it enabled
by default?

This gives management tools the ability to get performance if they want to,
but also satisfies your requirements in case management doesn't know the
feature exists - they will just get the default enabled?
Well unfortunately the "overcommit" feature as implemented seems useless
really.  Someone wants to migrate with RDMA but with low performance?
Why not migrate with TCP then?
Answer below.

Either way, I agree that the optimization would be very useful,
but I disagree that it is possible for an optimized registration algorithm
to perform *as well as* the case when there is no dynamic
registration at all.

The point is that dynamic registration *only* helps overcommitment.

It does nothing for performance - and since that's true any optimizations
that improve on dynamic registrations will always be sub-optimal to turning
off dynamic registration in the first place.

- Michael
So you've given up on it.  Question is, sub-optimal by how much?  And
where's the bottleneck?

Let's do some math. Assume you send 16 bytes registration request and
get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
32/4096 < 1% transport overhead. Negligeable.

Is it the source CPU then? But CPU on source is basically doing same
things as with pre-registration: you do not pin all memory on source.

So it must be the destination CPU that does not keep up then?
But it has to do even less than the source CPU.

I suggest one explanation: the protocol you proposed is inefficient.
It seems to basically do everything in a single thread:
get a chunk,pin,wait for control credit,request,response,rdma,unpin,
There are two round-trips of send/receive here where you are not
going anything useful. Why not let migration proceed?

Doesn't all of this sound worth checking before we give up?

First, let me remind you:

Chunks are already doing this!

Perhaps you don't fully understand how chunks work or perhaps I
should be more verbose
in the documentation. The protocol is already joining multiple pages into a
single chunk without issuing any writes. It is only until the chunk
is full that an
actual page registration request occurs.
I think I got that at a high level.
But there is a stall between chunks. If you make chunks smaller,
but pipeline registration, then there will never be any stall.

Pipelineing == chunking. You cannot eliminate the stall,
that's impossible.

You can *grow* the chunk size (i.e. the pipeline)
to amortize the cost of the stall, but you cannot eliminate
the stall at the end of the pipeline.

At some point you have to flush the pipeline (i.e. the chunk),
whether you like it or not.


So, basically what you want to know is what happens if we *change*
the chunk size
dynamically?
What I wanted to know is where is performance going?
Why is chunk based slower? It's not the extra messages,
on the wire, these take up negligeable BW.

Answer above.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]