qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's mem


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
Date: Fri, 27 Mar 2015 10:18:32 +0000
User-agent: Mutt/1.5.23 (2014-03-12)

* zhanghailiang (address@hidden) wrote:
> On 2015/3/26 11:52, Li Zhijian wrote:
> >On 03/26/2015 11:12 AM, Wen Congyang wrote:
> >>On 03/25/2015 05:50 PM, Juan Quintela wrote:
> >>>zhanghailiang<address@hidden>  wrote:
> >>>>Hi all,
> >>>>
> >>>>We found that, sometimes, the content of VM's memory is inconsistent 
> >>>>between Source side and Destination side
> >>>>when we check it just after finishing migration but before VM continue to 
> >>>>Run.
> >>>>
> >>>>We use a patch like bellow to find this issue, you can find it from affix,
> >>>>and Steps to reprduce:
> >>>>
> >>>>(1) Compile QEMU:
> >>>>  ./configure --target-list=x86_64-softmmu  --extra-ldflags="-lssl" && 
> >>>> make
> >>>>
> >>>>(2) Command and output:
> >>>>SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu 
> >>>>qemu64,-kvmclock -netdev tap,id=hn0-device 
> >>>>virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive 
> >>>>file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
> >>>> -device 
> >>>>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> >>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet 
> >>>>-monitor stdio
> >>>Could you try to reproduce:
> >>>- without vhost
> >>>- without virtio-net
> >>>- cache=unsafe is going to give you trouble, but trouble should only
> >>>   happen after migration of pages have finished.
> >>If I use ide disk, it doesn't happen.
> >>Even if I use virtio-net with vhost=on, it still doesn't happen. I guess
> >>it is because I migrate the guest when it is booting. The virtio net
> >>device is not used in this case.
> >Er??????
> >it reproduces in my ide disk
> >there is no any virtio device, my command line like below
> >
> >x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -net none
> >-boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp 2 -machine
> >usb=off -no-user-config -nodefaults -monitor stdio -vga std
> >
> >it seems easily to reproduce this issue by following steps in _ubuntu_ guest
> >1.  in source side, choose memtest in grub
> >2. do live migration
> >3. exit memtest(type Esc in when memory testing)
> >4. wait migration complete
> >
> 
> Yes???it is a thorny problem. It is indeed easy to reproduce, just as
> your steps in the above.
> 
> This is my test result: (I also test accel=tcg, it can be reproduced also.)
> Source side:
> # x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.3,accel=kvm,usb=off 
> -no-user-config -nodefaults  -cpu qemu64,-kvmclock -boot c -drive 
> file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device 
> cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio
> (qemu) ACPI_BUILD: init ACPI tables
> ACPI_BUILD: init ACPI tables
> migrate tcp:9.61.1.8:3004
> ACPI_BUILD: init ACPI tables
> before cpu_synchronize_all_states
> 5a8f72d66732cac80d6a0d5713654c0e
> md_host : before saving ram complete
> 5a8f72d66732cac80d6a0d5713654c0e
> md_host : after saving ram complete
> 5a8f72d66732cac80d6a0d5713654c0e
> (qemu)
>
> Destination side:
> # x86_64-softmmu/qemu-system-x86_64 -machine pc-i440fx-2.3,accel=kvm,usb=off 
> -no-user-config -nodefaults  -cpu qemu64,-kvmclock -boot c -drive 
> file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw -device 
> cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2 -monitor stdio 
> -incoming tcp:0:3004
> (qemu) QEMU_VM_SECTION_END, after loading ram
> d7cb0d8a4bdd1557fb0e78baee50c986
> md_host : after loading all vmstate
> d7cb0d8a4bdd1557fb0e78baee50c986
> md_host : after cpu_synchronize_all_post_init
> d7cb0d8a4bdd1557fb0e78baee50c986

Hmm, that's not good.  I suggest you md5 each of the RAMBlock's individually;
to see if it's main RAM that's different or something more subtle like
video RAM.

But then maybe it's easier just to dump the whole of RAM to file
and byte compare it (hexdump the two dumps and diff ?)

Dave

> 
> 
> Thanks,
> zhang
> 
> >>
> >>>What kind of load were you having when reproducing this issue?
> >>>Just to confirm, you have been able to reproduce this without COLO
> >>>patches, right?
> >>>
> >>>>(qemu) migrate tcp:192.168.3.8:3004
> >>>>before saving ram complete
> >>>>ff703f6889ab8701e4e040872d079a28
> >>>>md_host : after saving ram complete
> >>>>ff703f6889ab8701e4e040872d079a28
> >>>>
> >>>>DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu 
> >>>>qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device 
> >>>>virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive 
> >>>>file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
> >>>> -device 
> >>>>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> >>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet 
> >>>>-monitor stdio -incoming tcp:0:3004
> >>>>(qemu) QEMU_VM_SECTION_END, after loading ram
> >>>>230e1e68ece9cd4e769630e1bcb5ddfb
> >>>>md_host : after loading all vmstate
> >>>>230e1e68ece9cd4e769630e1bcb5ddfb
> >>>>md_host : after cpu_synchronize_all_post_init
> >>>>230e1e68ece9cd4e769630e1bcb5ddfb
> >>>>
> >>>>This happens occasionally, and it is more easy to reproduce when issue 
> >>>>migration command during VM's startup time.
> >>>OK, a couple of things.  Memory don't have to be exactly identical.
> >>>Virtio devices in particular do funny things on "post-load".  There
> >>>aren't warantees for that as far as I know, we should end with an
> >>>equivalent device state in memory.
> >>>
> >>>>We have done further test and found that some pages has been dirtied but 
> >>>>its corresponding migration_bitmap is not set.
> >>>>We can't figure out which modules of QEMU has missed setting bitmap when 
> >>>>dirty page of VM,
> >>>>it is very difficult for us to trace all the actions of dirtying VM's 
> >>>>pages.
> >>>This seems to point to a bug in one of the devices.
> >>>
> >>>>Actually, the first time we found this problem was in the COLO FT 
> >>>>development, and it triggered some strange issues in
> >>>>VM which all pointed to the issue of inconsistent of VM's memory. (We 
> >>>>have try to save all memory of VM to slave side every time
> >>>>when do checkpoint in COLO FT, and everything will be OK.)
> >>>>
> >>>>Is it OK for some pages that not transferred to destination when do 
> >>>>migration ? Or is it a bug?
> >>>Pages transferred should be the same, after device state transmission is
> >>>when things could change.
> >>>
> >>>>This issue has blocked our COLO development... :(
> >>>>
> >>>>Any help will be greatly appreciated!
> >>>Later, Juan.
> >>>
> >>.
> >>
> >
> >
> 
> 
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]