|
From: | Yoshiaki Tamura |
Subject: | Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1 |
Date: | Mon, 26 Apr 2010 19:44:11 +0900 |
User-agent: | Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 |
Anthony Liguori wrote:
On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:Anthony Liguori wrote:On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:2010/4/22 Dor Laor<address@hidden>:On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:Dor Laor wrote:On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/address@hidden/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example.[ snap]The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP.IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed?In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless.Why do you treat tcp differently? You can damage the entire VM this way - think of dhcp request that was dropped on the moment you switched between the master and the slave?I'm not trying to say that we should treat tcp differently, but just it's severe. In case of dhcp request, the client would have a chance to retry after failover, correct? BTW, in current implementation,I'm slightly confused about the current implementation vs. my recollection of the original paper with Xen. I had thought that all disk and network I/O was buffered in such a way that at each checkpoint, the I/O operations would be released in a burst. Otherwise, you would have to synchronize after every I/O operation which is what it seems the current implementation does.Yes, you're almost right. It's synchronizing before QEMU starts emulating I/O at each device model.If NodeA is the master and NodeB is the slave, if NodeA sends a network packet, you'll checkpoint before the packet is actually sent, and then if a failure occurs before the next checkpoint, won't that result in both NodeA and NodeB sending out a duplicate version of the packet?
Yes. But I think it's better than taking checkpoint after.If we checkpoint after sending packet, let's say it sent TCP ACK to the client, and if a hardware failure occurred to NodeA during the transaction *but the client received the TCP ACK*, NodeB will resume from the previous state, and it may need to receive some data from the client. However, because the client has already receiver TCP ACK, it won't resend the data to NodeB. It looks this data is going to be dropped.
Anyway, I've just started planning to move the sync point to network/block layer, and I would post the result for discussion again.
[Prev in Thread] | Current Thread | [Next in Thread] |