[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC] COLO HA Project proposal
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] [RFC] COLO HA Project proposal |
Date: |
Tue, 1 Jul 2014 13:12:48 +0100 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
* Hongyang Yang (address@hidden) wrote:
Hi Yang,
> Background:
> COLO HA project is a high availability solution. Both primary
> VM (PVM) and secondary VM (SVM) run in parallel. They receive the
> same request from client, and generate response in parallel too.
> If the response packets from PVM and SVM are identical, they are
> released immediately. Otherwise, a VM checkpoint (on demand) is
> conducted. The idea is presented in Xen summit 2012, and 2013,
> and academia paper in SOCC 2013. It's also presented in KVM forum
> 2013:
> http://www.linux-kvm.org/wiki/images/1/1d/Kvm-forum-2013-COLO.pdf
> Please refer to above document for detailed information.
Yes, I remember that talk - very interesting.
I didn't quite understand a couple of things though, perhaps you
can explain:
1) If we ignore the TCP sequence number problem, in an SMP machine
don't we get other randomnesses - e.g. which core completes something
first, or who wins a lock contention, so the output stream might not
be identical - so do those normal bits of randomness cause the machines
to flag as out-of-sync?
2) If the PVM has decided that the SVM is out of sync (due to 1) and
the PVM fails at about the same point - can we switch over to the SVM?
I'm worried that due to (1) there are periods where the system
is out-of-sync and a failure of the PVM is not protected. Does that happen?
If so how often?
> The attached was the architecture of kvm-COLO we proposed.
> - COLO Manager: Requires modifications of qemu
> - COLO Controller
> COLO Controller includes modifications of save/restore
> flow just like MC(macrocheckpoint), a memory cache on
> secondary VM which cache the dirty pages of primary VM
> and a failover module which provides APIs to communicate
> with external heartbead module.
> - COLO Disk Manager
> When pvm writes data into image, the colo disk manger
> captures this data and send it to the colo disk manger
> which makes sure the context of svm's image is consentient
> with the context of pvm's image.
I wonder if there is anyway to coordinate this between COLO, Michael
Hines microcheckpointing and the two separate reverse-execution
projects that also need to do some similar things.
Are there any standard APIs for the heartbeet thing we can already
tie into?
> - COLO Agent("Proxy module" in the arch picture)
> We need an agent to compare the packets returned by
> Primary VM and Secondary VM, and decide whether to start a
> checkpoint according to some rules. It is a linux kernel
> module for host.
Why is that a kernel module, and how does it communicate the state
to the QEMU instance?
> - Other minor modifications
> We may need other modifications for better performance.
Dave
P.S. I'm starting to look at fault-tolerance stuff, but haven't
got very far yet, so starting to try and understand the details
of COLO, microcheckpointing, etc
> --
> Thanks,
> Yang.
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
- Re: [Qemu-devel] [RFC] COLO HA Project proposal,
Dr. David Alan Gilbert <=
- Re: [Qemu-devel] [RFC] COLO HA Project proposal, Hongyang Yang, 2014/07/02
- Re: [Qemu-devel] [RFC] COLO HA Project proposal, Dong, Eddie, 2014/07/04
- Re: [Qemu-devel] [RFC] COLO HA Project proposal, Dr. David Alan Gilbert, 2014/07/04
- Re: [Qemu-devel] [RFC] COLO HA Project proposal, Dong, Eddie, 2014/07/04
- Re: [Qemu-devel] [RFC] COLO HA Project proposal, Dr. David Alan Gilbert, 2014/07/04
- Re: [Qemu-devel] [RFC] COLO HA Project proposal, Dong, Eddie, 2014/07/04
Re: [Qemu-devel] [RFC] COLO HA Project proposal, Michael R. Hines, 2014/07/09
Re: [Qemu-devel] [RFC] COLO HA Project proposal, Andreas Färber, 2014/07/04