[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo |
Date: |
Wed, 11 Mar 2015 10:07:11 +0000 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
* zhanghailiang (address@hidden) wrote:
> On 2015/3/11 17:06, Dr. David Alan Gilbert wrote:
> >* zhanghailiang (address@hidden) wrote:
> >>Hi Dave,
> >>
> >>Sorry for the late reply :)
> >
> >No problem.
> >
> >>On 2015/3/7 2:30, Dr. David Alan Gilbert wrote:
> >>>* zhanghailiang (address@hidden) wrote:
> >>>>On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
> >>>>>From: "Dr. David Alan Gilbert" <address@hidden>
> >>>>
> >>>>Hi Dave,
> >>>>
> >>>>>
> >>>>>Hi,
> >>>>> I'm getting COLO running on a couple of our machines here
> >>>>>and wanted to see what was actually going on, so I merged
> >>>>>in my recent rolling-stats code:
> >>>>>
> >>>>>http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
> >>>>>
> >>>>>with the following patch, and now I get on the primary side,
> >>>>>info migrate shows me:
> >>>>>
> >>>>>capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> >>>>>zero-blocks: off colo: on
> >>>>>Migration status: colo
> >>>>>total time: 0 milliseconds
> >>>>>colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted:
> >>>>>4.3136025e-158) Count: 4020 Values: address@hidden, address@hidden,
> >>>>>address@hidden, address@hidden, address@hidden, address@hidden,
> >>>>>address@hidden, address@hidden, address@hidden, address@hidden
> >>>>>colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted:
> >>>>>76.243584) Count: 4019 Values: address@hidden, address@hidden,
> >>>>>address@hidden, address@hidden, address@hidden, address@hidden,
> >>>>>address@hidden, address@hidden, address@hidden, address@hidden
> >>>>>colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4
> >>>>>(Weighted: 127195.56) Count: 4020 Values: address@hidden,
> >>>>>address@hidden, address@hidden, address@hidden, address@hidden,
> >>>>>address@hidden, address@hidden, address@hidden, address@hidden,
> >>>>>address@hidden
> >>>>>
> >>>>>which suggests I've got a problem with the packet comparison; but that's
> >>>>>a separate issue I'll look at.
> >>>>>
> >>>>
> >>>>There is an obvious mistake we have made in proxy, the macro
> >>>>'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
> >>>>so please fix it before do the follow test. Sorry for this low-grade
> >>>>mistake, we should do full test before issue it. ;)
> >>>
> >>>No, that's OK; we all make them.
> >>>
> >>>However, that didn't cure my problem; but after a bit of experimentation I
> >>>now have
> >>>COLO working pretty well; thanks for the help!
> >>>
> >>> 1) I had to disable IPv6 in the guest; it doesn't look like the
> >>> conntrack is coping with IPv6 ICMPV6, and on our test network
> >>> we're getting a few 10s of those each second, so it's constant
> >>> miscompares (they seem to be neighbour broadcasts and multicast
> >>> stuff).
> >>>
> >>
> >>Hmm, yes, the proxy code in github does not support ICMPV6 packet comparing.
> >>We will add this in the future.
> >>
> >>> 2) It looks like virtio-net is sending ARPs - possibly every time
> >>> that a snapshot is loaded; it's not the 'qemu' announce-self code,
> >>> (I added some debug there and it's not being called); and ARPs
> >>> cause a miscompare - so you get a continuous streem of miscompares
> >>> because a miscompare triggers a new snapshot, that sends more ARPs.
> >>> I solved this by switching to e1000.
> >>>
> >>
> >>I didn't meet this problem, i used tcpdump to capture the net packets and
> >>did not found any ARPs after VM load in slave.
> >
> >Interesting.
> >
> >>Maybe i missed something, Are there any servers/commands that net related
> >>run in VM?
> >
> >I don't think so, and even if they were, I don't think they would go away
> >by switching to an e1000; I see there is a 'VIRTIO_NET_S_ANNOUNCE' feature
> >in virtio-net, and I suspect it's that which is doing it, but maybe it
>
> >depends on the guest/host kernels to have it enabled?
> >
>
> Er, quite possible, My host kernel is 3.14.0, and guest is suse11sp3...
I'm running 3.18 on both host and guest (Fedora 20 guest, RHEL7 host but
with custom kernel).
Dave
>
> >>And what's your tcpdump command line?
> >
> >just tcpdump -i em4 -n -w outputfile
> >
> >>> 3) The other problem with virtio is it's occasionally triggering a
> >>> 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
> >>> why, the state COLO sends over should always be consistent.
> >>>
> >>> 4) With the e1000 setup; connections are generally fairly responsive,
> >>> but sshing into the guest takes *ages* (10s of seconds). I'm not sure
> >>> why, because a curl to a web server seems OK (less than a second)
> >>> and once the ssh is open it's pretty responsive.
> >>>
> >>
> >>Er, have you tried to ssh into the guest without in COLO mode? Is it also
> >>taking a long time?
> >
> >Not yet; I'm going to try and take some logging to it to find out why.
> >
> >>I have encounter a similar situation when the slave VM is faked dead which
> >>'info status' is 'running',
> >>but VM can not respond to keyboad from VNC. Maybe there is some thing wrong
> >>with device status, i
> >>will look into it.
> >>
> >>> 5) I've seen one instance of;
> >>> 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw:
> >>> Assertion `p - buf == aiocb->aio_nbytes' failed.'
> >>> on the primary side.
> >>>
> >>>Stats for a mostly idle guest are now showing:
> >>>
> >>>colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214)
> >>>Count: 227 Values: address@hidden, address@hidden, address@hidden,
> >>>address@hidden, address@hidden, address@hidden, address@hidden,
> >>>address@hidden, address@hidden, address@hidden
> >>>colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752)
> >>>Count: 227 Values: address@hidden, address@hidden, address@hidden,
> >>>address@hidden, address@hidden, address@hidden, address@hidden,
> >>>address@hidden, address@hidden, address@hidden
> >>>colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6
> >>>(Weighted: 4826386.5) Count: 227 Values: address@hidden, address@hidden,
> >>>address@hidden, address@hidden, address@hidden, address@hidden,
> >>>address@hidden, address@hidden, address@hidden, address@hidden
> >>>
> >>>So, one checkpoint every ~1.5 seconds; that's just with an
> >>>ssh connected and a script doing a 'curl' to it's http
> >>>repeatedly. Running 'top' on the ssh with a fast refresh
> >>>brings the checkpoints much faster; I guess that's because
> >>>the output of top is quite random.
> >>>
> >>
> >>Yes, it is a known problem, actually, not only 'top' command, every command
> >>with
> >>random output may result in continuous miscompare.
> >>Besides, the data transferred through SSH will be encrypted, which makes
> >>things more bad.
> >>
> >>One way to solve this problem maybe:
> >>if we detect a continuous stream of miscompares, we fall back to
> >>Microcheckpointing mode (periodic checkpoint).
> >
> >Yes, I was going to try and implement that fallback - I've got some ideas
> >to try for it.
> >
> >>>>To be honest, the proxy part in github is not integrated, we have cut it
> >>>>just for easy review and understand, so there may be some mistakes.
> >>>
> >>>Yes, that's OK; and I've had a few kernel crashes; normally
> >>>when the qemu crashes, the kernel doesn't really like it;
> >>>but that's OK, I'm sure it will get better.
> >>>
> >>
> >>Hmm, thanks very much for your feedback, we are making our efforts to
> >>better it... ;)
> >
> >Thanks,
> >
> >Dave
> >
> >>
> >>>I added the following to make my debug easier; which is how
> >>>I found the IPv6 problem.
> >>>
> >>>diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
> >>>index 9e50b62..13c0b48 100644
> >>>--- a/xt_PMYCOLO.c
> >>>+++ b/xt_PMYCOLO.c
> >>>@@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int
> >>>dataoff,
> >>> h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
> >>>
> >>> if (h == NULL) {
> >>>- pr_dbg("can't find master's ct for slaver packet\n");
> >>>+ pr_dbg("can't find master's ct for slaver packet
> >>>(pf/l3num=%d protonum=%d)\n", l3num, protonum);
> >>> return NULL;
> >>> }
> >>>
> >>>@@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int
> >>>hooknum,
> >>> /* rcu_read_lock()ed by nf_hook_slow */
> >>> l3proto = __nf_ct_l3proto_find(pf);
> >>> if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff,
> >>> &protonum) <= 0) {
> >>>- pr_dbg("slaver: l3proto not prepared to track yet or error
> >>>occurred\n");
> >>>+ pr_dbg("slaver: l3proto not prepared to track yet or error
> >>>occurred (pf=%d)\n", pf);
> >>> NF_CT_STAT_INC_ATOMIC(&init_net, error);
> >>> NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
> >>> goto out;
> >>>
> >>>>
> >>>>Thanks,
> >>>>zhanghailiang
> >>>
> >>>Thanks,
> >>>
> >>>Dave
> >>>>
> >>>>
> >>>>>Dave
> >>>>>
> >>>>>Dr. David Alan Gilbert (1):
> >>>>> COLO: Add primary side rolling statistics
> >>>>>
> >>>>> hmp.c | 12 ++++++++++++
> >>>>> include/migration/migration.h | 3 +++
> >>>>> migration/colo.c | 15 +++++++++++++++
> >>>>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
> >>>>> qapi-schema.json | 11 ++++++++++-
> >>>>> 5 files changed, 70 insertions(+), 1 deletion(-)
> >>>>>
> >>>>
> >>>>
> >>>--
> >>>Dr. David Alan Gilbert / address@hidden / Manchester, UK
> >>>
> >>>.
> >>>
> >>
> >>
> >--
> >Dr. David Alan Gilbert / address@hidden / Manchester, UK
> >
> >.
> >
>
>
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
- [Qemu-devel] [RFC 1/1] COLO: Add primary side rolling statistics, (continued)
- [Qemu-devel] [RFC 1/1] COLO: Add primary side rolling statistics, Dr. David Alan Gilbert (git), 2015/03/05
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/05
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/10
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, Dr. David Alan Gilbert, 2015/03/11
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo, zhanghailiang, 2015/03/11
- Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo,
Dr. David Alan Gilbert <=