qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-ch


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
Date: Wed, 19 Feb 2014 11:27:16 +0000
User-agent: Mutt/1.5.21 (2010-09-15)

* Michael R. Hines (address@hidden) wrote:
> On 02/18/2014 08:45 PM, Dr. David Alan Gilbert wrote:
> >>+The Micro-Checkpointing Process
> >>+Basic Algorithm
> >>+Micro-Checkpoints (MC) work against the existing live migration path in 
> >>QEMU, and can effectively be understood as a "live migration that never 
> >>ends". As such, iteration rounds happen at the granularity of 10s of 
> >>milliseconds and perform the following steps:
> >>+
> >>+1. After N milliseconds, stop the VM.
> >>+3. Generate a MC by invoking the live migration software path to identify 
> >>and copy dirty memory into a local staging area inside QEMU.
> >>+4. Resume the VM immediately so that it can make forward progress.
> >>+5. Transmit the checkpoint to the destination.
> >>+6. Repeat
> >>+Upon failure, load the contents of the last MC at the destination back 
> >>into memory and run the VM normally.
> >Later you talk about the memory allocation and how you grow the memory as 
> >needed
> >to fit the checkpoint, have you tried going the other way and triggering the
> >checkpoints sooner if they're taking too much memory?
> 
> There is a 'knob' in this patch called "mc-set-delay" which was designed
> to solve exactly that problem. It allows policy or management software
> to make an independent decision about what the frequency of the
> checkpoints should be.
> 
> I wasn't comfortable implementing policy directly inside the patch as
> that seemed less likely to get accepted by the community sooner.

I was just wondering if a separate 'max buffer size' knob would allow
you to more reasonably bound memory without setting policy; I don't think
people like having potentially x2 memory.

> >>+1. MC over TCP/IP: Once the socket connection breaks, we assume
> >>failure. This happens very early in the loss of the latest MC not only
> >>because a very large amount of bytes is typically being sequenced in a
> >>TCP stream but perhaps also because of the timeout in acknowledgement
> >>of the receipt of a commit message by the destination.
> >>+
> >>+2. MC over RDMA: Since Infiniband does not provide any underlying
> >>timeout mechanisms, this implementation enhances QEMU's RDMA migration
> >>protocol to include a simple keep-alive. Upon the loss of multiple
> >>keep-alive messages, the sender is deemed to have failed.
> >>+
> >>+In both cases, either due to a failed TCP socket connection or lost RDMA 
> >>keep-alive group, both the sender or the receiver can be deemed to have 
> >>failed.
> >>+
> >>+If the sender is deemed to have failed, the destination takes over 
> >>immediately using the contents of the last checkpoint.
> >>+
> >>+If the destination is deemed to be lost, we perform the same action
> >>as a live migration: resume the sender normally and wait for management
> >>software to make a policy decision about whether or not to re-protect
> >>the VM, which may involve a third-party to identify a new destination
> >>host again to use as a backup for the VM.
> >In this world what is making the decision about whether the 
> >sender/destination
> >should win - how do you avoid a split brain situation where both
> >VMs are running but the only thing that failed is the comms between them?
> >Is there any guarantee that you'll have received knowledge of the comms
> >failure before you pull the plug out and enable the corked packets to be
> >sent on the sender side?
> 
> Good question in general - I'll add it to the FAQ. The patch implements
> a basic 'transaction' mechanism in coordination with an outbound I/O
> buffer (documented further down). With these two things in
> places, split-brain is not possible because the destination is not running.
> We don't allow the destination to resume execution until a committed
> transaction has been acknowledged by the destination and only until
> then do we allow any outbound network traffic to be release to the
> outside world.

Yeh I see the IO buffer, what I've not figured out is how:
  1) MC over TCP/IP gets an acknowledge on the source to know when
     it can unplug it's buffer.
  2) Lets say the MC connection fails, so that ack never arrives,
     the source must assume the destination has failed and release it's
     packets and carry on.
     The destination must assume the source has failed and take over.

     Now they're both running - and that's bad and it's standard
     split brain.
  3) If we're relying on TCP/IP timeout that's quite long.

> >>+RDMA is used for two different reasons:
> >>+
> >>+1. Checkpoint generation (RDMA-based memcpy):
> >>+2. Checkpoint transmission
> >>+Checkpoint generation must be done while the VM is paused. In the
> >>worst case, the size of the checkpoint can be equal in size to the amount
> >>of memory in total use by the VM. In order to resume VM execution as
> >>fast as possible, the checkpoint is copied consistently locally into
> >>a staging area before transmission. A standard memcpy() of potentially
> >>such a large amount of memory not only gets no use out of the CPU cache
> >>but also potentially clogs up the CPU pipeline which would otherwise
> >>be useful by other neighbor VMs on the same physical node that could be
> >>scheduled for execution. To minimize the effect on neighbor VMs, we use
> >>RDMA to perform a "local" memcpy(), bypassing the host processor. On
> >>more recent processors, a 'beefy' enough memory bus architecture can
> >>move memory just as fast (sometimes faster) as a pure-software CPU-only
> >>optimized memcpy() from libc. However, on older computers, this feature
> >>only gives you the benefit of lower CPU-utilization at the expense of
> >Isn't there a generic kernel DMA ABI for doing this type of thing (I
> >think there was at one point, people have suggested things like using
> >graphics cards to do it but I don't know if it ever happened).
> >The other question is, do you always need to copy - what about something
> >like COWing the pages?
> 
> Excellent question! Responding in two parts:
> 
> 1) The kernel ABI 'vmsplice' is what I think you're referring to. Correct
>      me if I'm wrong, but vmsplice was actually designed to avoid copies
>      entirely between two userspace programs to be able to move memory
>      more efficiently - whereas a fault tolerant system actually *needs*
>      copy to be made.

No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
of the use of Intel's I/OAT, graphics cards, etc for doing things like page
zeroing and DMAing data around; I can see there is a dmaengine API in the
kernel, I haven't found where if anywhere that is available to userspace.

> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>      around with my colleagues, but we simply didn't have the manpower
>      to implement it and benchmark it. There was also some concern about
>      performance: Would the writable working set of the guest be so
> active/busy
>      that COW would not get you much benefit? I think it's worth a try.
>      Patches welcome =)

It's possible that might be doable with some of the same tricks I'm
looking at for post-copy, I'll see what I can do.

Dave
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]