[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's mem
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration |
Date: |
Wed, 25 Mar 2015 09:46:40 +0000 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
* zhanghailiang (address@hidden) wrote:
> Hi all,
>
> We found that, sometimes, the content of VM's memory is inconsistent between
> Source side and Destination side
> when we check it just after finishing migration but before VM continue to Run.
>
> We use a patch like bellow to find this issue, you can find it from affix,
> and Steps to reproduce:
>
> (1) Compile QEMU:
> ./configure --target-list=x86_64-softmmu --extra-ldflags="-lssl" && make
>
> (2) Command and output:
> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock
> -netdev tap,id=hn0-device virtio-net-pci,id=net-pci0,netdev=hn0 -boot c
> -drive
> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
> -device
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor
> stdio
> (qemu) migrate tcp:192.168.3.8:3004
> before saving ram complete
> ff703f6889ab8701e4e040872d079a28
> md_host : after saving ram complete
> ff703f6889ab8701e4e040872d079a28
>
> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock
> -netdev tap,id=hn0,vhost=on -device virtio-net-pci,id=net-pci0,netdev=hn0
> -boot c -drive
> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
> -device
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor
> stdio -incoming tcp:0:3004
> (qemu) QEMU_VM_SECTION_END, after loading ram
> 230e1e68ece9cd4e769630e1bcb5ddfb
> md_host : after loading all vmstate
> 230e1e68ece9cd4e769630e1bcb5ddfb
> md_host : after cpu_synchronize_all_post_init
> 230e1e68ece9cd4e769630e1bcb5ddfb
>
> This happens occasionally, and it is more easy to reproduce when issue
> migration command during VM's startup time.
>
> We have done further test and found that some pages has been dirtied but its
> corresponding migration_bitmap is not set.
> We can't figure out which modules of QEMU has missed setting bitmap when
> dirty page of VM,
> it is very difficult for us to trace all the actions of dirtying VM's pages.
>
> Actually, the first time we found this problem was in the COLO FT
> development, and it triggered some strange issues in
> VM which all pointed to the issue of inconsistent of VM's memory. (We have
> try to save all memory of VM to slave side every time
> when do checkpoint in COLO FT, and everything will be OK.)
>
> Is it OK for some pages that not transferred to destination when do migration
> ? Or is it a bug?
That does sound like a bug.
The only other explanation I have is that memory is being changed by a device
emulation
that happens after the end of a saving the vm, or after loading the memory.
That's
certainly possible - especially if a device (say networking) hasn't been
properly
stopped.
> This issue has blocked our COLO development... :(
>
> Any help will be greatly appreciated!
I suggest:
1) Does it happen with devices other than virtio?
2) Strip the devices down - e.g. just run with serial and no video/usb
3) Try doing the md5 comparison at the end of ram_save_complete
4) mprotect RAM after the ram_save_complete and see if anything faults.
5) Can you trigger this with normal migration or just COLO?
I'm wondering if something is doing something on a running/paused/etc
state
change and isn't expecting the new COLO states.
Dave
>
> Thanks,
> zhanghailiang
>
> --- a/savevm.c
> +++ b/savevm.c
> @@ -51,6 +51,26 @@
> #define ARP_PTYPE_IP 0x0800
> #define ARP_OP_REQUEST_REV 0x3
>
> +#include "qemu/rcu_queue.h"
> +#include <openssl/md5.h>
> +
> +static void check_host_md5(void)
> +{
> + int i;
> + unsigned char md[MD5_DIGEST_LENGTH];
> + MD5_CTX ctx;
> + RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
> 'pc.ram' block */
> +
> + MD5_Init(&ctx);
> + MD5_Update(&ctx, (void *)block->host, block->used_length);
> + MD5_Final(md, &ctx);
> + printf("md_host : ");
> + for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
> + fprintf(stderr, "%02x", md[i]);
> + }
> + fprintf(stderr, "\n");
> +}
> +
> static int announce_self_create(uint8_t *buf,
> uint8_t *mac_addr)
> {
> @@ -741,7 +761,13 @@ void qemu_savevm_state_complete(QEMUFile *f)
> qemu_put_byte(f, QEMU_VM_SECTION_END);
> qemu_put_be32(f, se->section_id);
>
> + printf("before saving %s complete\n", se->idstr);
> + check_host_md5();
> +
> ret = se->ops->save_live_complete(f, se->opaque);
> + printf("after saving %s complete\n", se->idstr);
> + check_host_md5();
> +
> trace_savevm_section_end(se->idstr, se->section_id, ret);
> if (ret < 0) {
> qemu_file_set_error(f, ret);
> @@ -1030,6 +1063,11 @@ int qemu_loadvm_state(QEMUFile *f)
> }
>
> ret = vmstate_load(f, le->se, le->version_id);
> + if (section_type == QEMU_VM_SECTION_END) {
> + printf("QEMU_VM_SECTION_END, after loading %s\n",
> le->se->idstr);
> + check_host_md5();
> + }
> +
> if (ret < 0) {
> error_report("error while loading state section id %d(%s)",
> section_id, le->se->idstr);
> @@ -1061,7 +1099,11 @@ int qemu_loadvm_state(QEMUFile *f)
> g_free(buf);
> }
>
> + printf("after loading all vmstate\n");
> + check_host_md5();
> cpu_synchronize_all_post_init();
> + printf("after cpu_synchronize_all_post_init\n");
> + check_host_md5();
>
> ret = 0;
>
> --
> From ecb789cf7f383b112da3cce33eb9822a94b9497a Mon Sep 17 00:00:00 2001
> From: Li Zhijian <address@hidden>
> Date: Tue, 24 Mar 2015 21:53:26 -0400
> Subject: [PATCH] check pc.ram block md5sum between migration Source and
> Destination
>
> Signed-off-by: Li Zhijian <address@hidden>
> ---
> savevm.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 42 insertions(+)
> mode change 100644 => 100755 savevm.c
>
> diff --git a/savevm.c b/savevm.c
> old mode 100644
> new mode 100755
> index 3b0e222..3d431dc
> --- a/savevm.c
> +++ b/savevm.c
> @@ -51,6 +51,26 @@
> #define ARP_PTYPE_IP 0x0800
> #define ARP_OP_REQUEST_REV 0x3
>
> +#include "qemu/rcu_queue.h"
> +#include <openssl/md5.h>
> +
> +static void check_host_md5(void)
> +{
> + int i;
> + unsigned char md[MD5_DIGEST_LENGTH];
> + MD5_CTX ctx;
> + RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
> 'pc.ram' block */
> +
> + MD5_Init(&ctx);
> + MD5_Update(&ctx, (void *)block->host, block->used_length);
> + MD5_Final(md, &ctx);
> + printf("md_host : ");
> + for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
> + fprintf(stderr, "%02x", md[i]);
> + }
> + fprintf(stderr, "\n");
> +}
> +
> static int announce_self_create(uint8_t *buf,
> uint8_t *mac_addr)
> {
> @@ -741,7 +761,13 @@ void qemu_savevm_state_complete(QEMUFile *f)
> qemu_put_byte(f, QEMU_VM_SECTION_END);
> qemu_put_be32(f, se->section_id);
>
> + printf("before saving %s complete\n", se->idstr);
> + check_host_md5();
> +
> ret = se->ops->save_live_complete(f, se->opaque);
> + printf("after saving %s complete\n", se->idstr);
> + check_host_md5();
> +
> trace_savevm_section_end(se->idstr, se->section_id, ret);
> if (ret < 0) {
> qemu_file_set_error(f, ret);
> @@ -1007,6 +1033,13 @@ int qemu_loadvm_state(QEMUFile *f)
> QLIST_INSERT_HEAD(&loadvm_handlers, le, entry);
>
> ret = vmstate_load(f, le->se, le->version_id);
> +#if 0
> + if (section_type == QEMU_VM_SECTION_FULL) {
> + printf("QEMU_VM_SECTION_FULL, after loading %s\n",
> le->se->idstr);
> + check_host_md5();
> + }
> +#endif
> +
> if (ret < 0) {
> error_report("error while loading state for instance 0x%x of"
> " device '%s'", instance_id, idstr);
> @@ -1030,6 +1063,11 @@ int qemu_loadvm_state(QEMUFile *f)
> }
>
> ret = vmstate_load(f, le->se, le->version_id);
> + if (section_type == QEMU_VM_SECTION_END) {
> + printf("QEMU_VM_SECTION_END, after loading %s\n",
> le->se->idstr);
> + check_host_md5();
> + }
> +
> if (ret < 0) {
> error_report("error while loading state section id %d(%s)",
> section_id, le->se->idstr);
> @@ -1061,7 +1099,11 @@ int qemu_loadvm_state(QEMUFile *f)
> g_free(buf);
> }
>
> + printf("after loading all vmstate\n");
> + check_host_md5();
> cpu_synchronize_all_post_init();
> + printf("after cpu_synchronize_all_post_init\n");
> + check_host_md5();
>
> ret = 0;
>
> --
> 1.7.12.4
>
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
- [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration, zhanghailiang, 2015/03/25
- Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration,
Dr. David Alan Gilbert <=
- Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration, Juan Quintela, 2015/03/25
- Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration, zhanghailiang, 2015/03/25
- Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration, Wen Congyang, 2015/03/25