[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protoc

From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
Date: Wed, 10 Apr 2013 20:41:07 +0300

On Wed, Apr 10, 2013 at 11:29:24AM -0400, Michael R. Hines wrote:
> On 04/10/2013 09:34 AM, Michael S. Tsirkin wrote:
> >On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
> >>On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
> >>>Below is a great high level overview. the protocol looks correct.
> >>>A bit more detail would be helpful, as noted below.
> >>>
> >>>The main thing I'd like to see changed is that there are already
> >>>two protocols here: chunk-based and non chunk based.
> >>>We'll need to use versioning and capabilities going forward but in the
> >>>first version we don't need to maintain compatibility with legacy so
> >>>two versions seems like unnecessary pain.  Chunk based is somewhat slower 
> >>>and
> >>>that is worth fixing longer term, but seems like the way forward. So
> >>>let's implement a single chunk-based protocol in the first version we
> >>>merge.
> >>>
> >>>Some more minor improvement suggestions below.
> >>Thanks.
> >>
> >>However, IMHO restricting the policy to only used chunk-based is really
> >>not an acceptable choice:
> >>
> >>Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
> >>dive from 10gbps to 6gbps.
> >Who cares about the throughput really? What we do care about
> >is how long the whole process takes.
> >
> Low latency and high throughput is very important =)
> Without these properties of RDMA, many workloads simply either
> take to long to finish migrating or do not converge to a stopping
> point altogether.
> *Not making this a configurable option would defeat the purpose of
> using RDMA altogether.
> Otherwise, you're no better off than just using TCP.

So we have two protocols implemented: one is slow the other pins all
memory on destination indefinitely.

I see two options here:
- improve the slow version so it's fast, drop the pin all version
- give up and declare RDMA requires pinning all memory on destination

But giving management a way to do RDMA at the speed of TCP? Why is this

> >
> >>But if I disable chunk-based registration altogether (forgoing
> >>overcommit), then performance comes back.
> >>
> >>The reason for this is is the additional control trannel traffic
> >>needed to ask the server to register
> >>memory pages on demand - without this traffic, we can easily
> >>saturate the link.
> >>But with this traffic, the user needs to know (and be given the
> >>option) to disable the feature
> >>in case they want performance instead of flexibility.
> >>
> >IMO that's just because the current control protocol is so inefficient.
> >You just need to pipeline the registration: request the next chunk
> >while remote side is handling the previous one(s).
> >
> >With any protocol, you still need to:
> >     register all memory
> >     send addresses and keys to source
> >     get notification that write is done
> >what is different with chunk based?
> >simply that there are several network roundtrips
> >before the process can start.
> >So part of the time you are not doing writes,
> >you are waiting for the next control message.
> >
> >So you should be doing several in parallel.
> >This will complicate the procotol though, so I am not asking
> >for this right away.
> >
> >But a broken pin-it-all alternative will just confuse matters.  It is
> >best to keep it out of tree.
> There's a huge difference. (Answer continued below this one).
> The devil is in the details, here: Pipelining is simply not possible
> right now because the migration thread has total control over
> when and which pages are requested to be migrated.
> You can't pipeline page registrations if you don't know the pages
> are dirty -
> and the only way to that pages are dirty is if the migration thread told
> you to save them.

So it tells you to save them. It does not mean you need to start
RDMA immediately.  Note the address and start the process of
notifying the remote.

> On the other hand, advanced registration of *known* dirty pages
> is very important - I will certainly be submitting a patch in the future
> which attempts to handle this case.

Maybe I miss something, and there are changes in the migration core
that are prerequisite to making rdma fast. So take the time and make
these changes, that's better than maintaining a broken protocol

> >So make the protocol smarter and fix this. This is not something
> >management needs to know about.
> >
> >
> >If you like, you can teach management to specify the max amount of
> >memory pinned. It should be specified at the appropriate place:
> >on the remote for remote, on source for source.
> >
> Answer below.
> >>>
> >>>What is meant by performance here? downtime?
> >>Throughput. Zero page scanning (and dynamic registration) reduces
> >>throughput significantly.
> >Again, not something management should worry about.
> >Do the right thing internally.
> I disagree with that: This is an entirely workload-specific decision,
> not a system-level decision.
> If I have a known memory-intensive workload that is virtualized,
> then it would be "too late" to disable zero page detection *after*
> the RDMA migration begins.
> We have management tools already that are that smart - there's
> nothing wrong with smart managment knowing in advance that
> a workload is memory-intensive and also knowing that an RDMA
> migration is going to be issued.

"zero page detection" just cries out "implementation specific".

There's very little chance e.g. a different algorithm will have exactly
same performance tradeoffs. So we change some qemu internals and
suddenly your management carefully tuned for your workload is making all
the wrong decisions.

> There's no way for QEMU to know that in advance without some kind
> of advanced heuristic that tracks the behavior of the VM over time,
> which I don't think anybody wants to get into the business of writing =)

There's even less chance a management tool will make an
intelligent decision here. It's too tied to QEMU internals.

> >>>>+
> >>>>+SEND messages require more coordination because the
> >>>>+receiver must have reserved space (using a receive
> >>>>+work request) on the receive queue (RQ) before QEMUFileRDMA
> >>>>+can start using them to carry all the bytes as
> >>>>+a transport for migration of device state.
> >>>>+
> >>>>+To begin the migration, the initial connection setup is
> >>>>+as follows (migration-rdma.c):
> >>>>+
> >>>>+1. Receiver and Sender are started (command line or libvirt):
> >>>>+2. Both sides post two RQ work requests
> >>>Okay this could be where the problem is. This means with chunk
> >>>based receive side does:
> >>>
> >>>loop:
> >>>   receive request
> >>>   register
> >>>   send response
> >>>
> >>>while with non chunk based it does:
> >>>
> >>>receive request
> >>>send response
> >>>loop:
> >>>   register
> >>No, that's incorrect. With "non" chunk based, the receive side does
> >>*not* communicate
> >>during the migration of pc.ram.
> >It does not matter when this happens. What we care about is downtime and
> >total time from start of qemu on remote and until migration completes.
> >Not peak throughput.
> >If you don't count registration time on remote, that's just wrong.
> Answer above.

I don't see it above.
> >>The control channel is only used for chunk registration and device
> >>state, not RAM.
> >>
> >>I will update the documentation to make that more clear.
> >It's clear enough I think. But it seems you are measuring
> >the wrong things.
> >
> >>>In reality each request/response requires two network round-trips
> >>>with the Ready credit-management messsages.
> >>>So the overhead will likely be avoided if we add better pipelining:
> >>>allow multiple registration requests in the air, and add more
> >>>send/receive credits so the overhead of credit management can be
> >>>reduced.
> >>Unfortunately, the migration thread doesn't work that way.
> >>The thread only generates one page write at-a-time.
> >Yes but you do not have to block it. Each page is in these states:
> >     - unpinned not sent
> >     - pinned no rkey
> >     - pinned have rkey
> >     - unpinned sent
> >
> >Each time you get a new page, it's in unpinned not sent state.
> >So you can start it on this state machine, and tell migration thread
> >to proceed tothe next page.
> Yes, I'm doing that already (documented as "batching") in the
> docs file.

All I see is a scheme to reduce the number of transmit completions.
This only gives a marginal gain.  E.g. you explicitly say there's a
single command in the air so another registration request can not even
start until you get a registration response.

> But the problem is more complicated than that: there is no coordination
> between the migration_thread and RDMA right now because Paolo is
> trying to maintain a very clean separation of function.
> However we *can* do what you described in a future patch like this:
> 1. Migration thread says "iteration starts, how much memory is dirty?"
> 2. RDMA protocol says "Is there a lot of dirty memory?"
>         OK, yes? Then batch all the registration messages into a
> single request
>         but do not write the memory until all the registrations have
> completed.
>         OK, no?  Then just issue registrations with very little
> batching so that
>                       we can quickly move on to the next iteration round.
> Make sense?

Actually, I think you just need to get a page from migration core and
give it to the FSM above.  Then let it give you another page, until you
have N pages in flight in the FSM all at different stages in the
pipeline.  That's the theory.

But if you want to try changing management core, go wild.  Very little
is written in stone here.

> >>If someone were to write a patch which submits multiple
> >>writes at the same time, I would be very interested in
> >>consuming that feature and making chunk registration more
> >>efficient by batching multiple registrations into fewer messages.
> >No changes to migration core is necessary I think.
> >But assuming they are - your protocol design and
> >management API should not be driven by internal qemu APIs.
> Answer above.
> >>>There's no requirement to implement these optimizations upfront
> >>>before merging the first version, but let's remove the
> >>>non-chunkbased crutch unless we see it as absolutely necessary.
> >>>
> >>>>+3. Receiver does listen()
> >>>>+4. Sender does connect()
> >>>>+5. Receiver accept()
> >>>>+6. Check versioning and capabilities (described later)
> >>>>+
> >>>>+At this point, we define a control channel on top of SEND messages
> >>>>+which is described by a formal protocol. Each SEND message has a
> >>>>+header portion and a data portion (but together are transmitted
> >>>>+as a single SEND message).
> >>>>+
> >>>>+Header:
> >>>>+    * Length  (of the data portion)
> >>>>+    * Type    (what command to perform, described below)
> >>>>+    * Version (protocol version validated before send/recv occurs)
> >>>What's the expected value for Version field?
> >>>Also, confusing.  Below mentions using private field in librdmacm instead?
> >>>Need to add # of bytes and endian-ness of each field.
> >>Correct, those are two separate versions. One for capability negotiation
> >>and one for the protocol itself.
> >>
> >>I will update the documentation.
> >Just drop the all-pinned version, and we'll work to improve
> >the chunk-based one until it has reasonable performance.
> >It seems to get a decent speed already: consider that
> >most people run migration with the default speed limit.
> >Supporting all-pinned will just be a pain down the road when
> >we fix performance for chunk based one.
> >
> The speed tops out at 6gbps, that's not good enough for a 40gbps link.
> The migration could complete *much* faster by disabling chunk registration.
> We have very large physical machines, where chunk registration is
> not as important
> as migrating the workload very quickly with very little downtime.
> In these cases, chunk registration just "gets in the way".

Well IMO you give up too early.

It gets in the way because you are not doing data transfers while
you are doing registration. You are doing it by chunks on the
source and source is much busier, it needs to find dirty pages,
and it needs to run VCPUs. Surely remote which is mostly idle should
be able to keep up with the demand.

Just fix the protocol so the control latency is less of the problem.

> >>>>+
> >>>>+The 'type' field has 7 different command values:
> >>>0. Unused.
> >>>
> >>>>+    1. None
> >>>you mean this is unused?
> >>Correct - will update.
> >>
> >>>>+    2. Ready             (control-channel is available)
> >>>>+    3. QEMU File         (for sending non-live device state)
> >>>>+    4. RAM Blocks        (used right after connection setup)
> >>>>+    5. Register request  (dynamic chunk registration)
> >>>>+    6. Register result   ('rkey' to be used by sender)
> >>>Hmm, don't you also need a virtual address for RDMA writes?
> >>>
> >>The virtual addresses are communicated at the beginning of the
> >>migration using command #4 "Ram blocks".
> >Yes but ram blocks are sent source to dest.
> >virtual address needs to be sent dest to source no?
> I just said that, no? =)

You didn't previously.

> >>
> >>There is only one capability right now: dynamic server registration.
> >>
> >>The client must tell the server whether or not the capability was
> >>enabled or not on the primary VM side.
> >>
> >>Will update the documentation.
> >Cool, best add an exact structure format.
> Acnkowledged.
> >>>>+
> >>>>+Main memory is not migrated with the aforementioned protocol,
> >>>>+but is instead migrated with normal RDMA Write operations.
> >>>>+
> >>>>+Pages are migrated in "chunks" (about 1 Megabyte right now).
> >>>Why "about"? This is not dynamic so needs to be exactly same
> >>>on both sides, right?
> >>About is a typo =). It is hard-coded to exactly 1MB.
> >This, by the way, is something management *may* want to control.
> Acknowledged.
> >>>>+Chunk size is not dynamic, but it could be in a future implementation.
> >>>>+There's nothing to indicate that this is useful right now.
> >>>>+
> >>>>+When a chunk is full (or a flush() occurs), the memory backed by
> >>>>+the chunk is registered with librdmacm and pinned in memory on
> >>>>+both sides using the aforementioned protocol.
> >>>>+
> >>>>+After pinning, an RDMA Write is generated and tramsmitted
> >>>>+for the entire chunk.
> >>>>+
> >>>>+Chunks are also transmitted in batches: This means that we
> >>>>+do not request that the hardware signal the completion queue
> >>>>+for the completion of *every* chunk. The current batch size
> >>>>+is about 64 chunks (corresponding to 64 MB of memory).
> >>>>+Only the last chunk in a batch must be signaled.
> >>>>+This helps keep everything as asynchronous as possible
> >>>>+and helps keep the hardware busy performing RDMA operations.
> >>>>+
> >>>>+Error-handling:
> >>>>+===============================
> >>>>+
> >>>>+Infiniband has what is called a "Reliable, Connected"
> >>>>+link (one of 4 choices). This is the mode in which
> >>>>+we use for RDMA migration.
> >>>>+
> >>>>+If a *single* message fails,
> >>>>+the decision is to abort the migration entirely and
> >>>>+cleanup all the RDMA descriptors and unregister all
> >>>>+the memory.
> >>>>+
> >>>>+After cleanup, the Virtual Machine is returned to normal
> >>>>+operation the same way that would happen if the TCP
> >>>>+socket is broken during a non-RDMA based migration.
> >>>That's on sender side? Presumably this means you respond to
> >>>completion with error?
> >>>  How does receive side know
> >>>migration is complete?
> >>Yes, on the sender side.
> >>
> >>Migration "completeness" logic has not changed in this patch series.
> >>
> >>Pleas recall that the entire QEMUFile protocol is still
> >>happening at the upper-level inside of savevm.c/arch_init.c.
> >>
> >So basically receive side detects that migration is complete by
> >looking at the QEMUFile data?
> >
> That's correct - same mechanism used by TCP.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]