qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's mem


From: Jason Wang
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
Date: Wed, 08 Apr 2015 16:08:35 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0


On 04/03/2015 05:20 PM, zhanghailiang wrote:
> On 2015/4/3 16:51, Jason Wang wrote:
>>
>>
>> On 04/02/2015 07:52 PM, zhanghailiang wrote:
>>> On 2015/4/1 3:06, Dr. David Alan Gilbert wrote:
>>>> * zhanghailiang (address@hidden) wrote:
>>>>> On 2015/3/30 15:59, Dr. David Alan Gilbert wrote:
>>>>>> * zhanghailiang (address@hidden) wrote:
>>>>>>> On 2015/3/27 18:18, Dr. David Alan Gilbert wrote:
>>>>>>>> * zhanghailiang (address@hidden) wrote:
>>>>>>>>> On 2015/3/26 11:52, Li Zhijian wrote:
>>>>>>>>>> On 03/26/2015 11:12 AM, Wen Congyang wrote:
>>>>>>>>>>> On 03/25/2015 05:50 PM, Juan Quintela wrote:
>>>>>>>>>>>> zhanghailiang<address@hidden>  wrote:
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We found that, sometimes, the content of VM's memory is
>>>>>>>>>>>>> inconsistent between Source side and Destination side
>>>>>>>>>>>>> when we check it just after finishing migration but before
>>>>>>>>>>>>> VM continue to Run.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We use a patch like bellow to find this issue, you can find
>>>>>>>>>>>>> it from affix,
>>>>>>>>>>>>> and Steps to reprduce:
>>>>>>>>>>>>>
>>>>>>>>>>>>> (1) Compile QEMU:
>>>>>>>>>>>>>    ./configure --target-list=x86_64-softmmu
>>>>>>>>>>>>> --extra-ldflags="-lssl" && make
>>>>>>>>>>>>>
>>>>>>>>>>>>> (2) Command and output:
>>>>>>>>>>>>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
>>>>>>>>>>>>> qemu64,-kvmclock -netdev tap,id=hn0-device
>>>>>>>>>>>>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
>>>>>>>>>>>>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
>>>>>>>>>>>>>
>>>>>>>>>>>>> -device
>>>>>>>>>>>>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
>>>>>>>>>>>>>
>>>>>>>>>>>>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device
>>>>>>>>>>>>> usb-tablet -monitor stdio
>>>>>>>>>>>> Could you try to reproduce:
>>>>>>>>>>>> - without vhost
>>>>>>>>>>>> - without virtio-net
>>>>>>>>>>>> - cache=unsafe is going to give you trouble, but trouble
>>>>>>>>>>>> should only
>>>>>>>>>>>>     happen after migration of pages have finished.
>>>>>>>>>>> If I use ide disk, it doesn't happen.
>>>>>>>>>>> Even if I use virtio-net with vhost=on, it still doesn't
>>>>>>>>>>> happen. I guess
>>>>>>>>>>> it is because I migrate the guest when it is booting. The
>>>>>>>>>>> virtio net
>>>>>>>>>>> device is not used in this case.
>>>>>>>>>> Er??????
>>>>>>>>>> it reproduces in my ide disk
>>>>>>>>>> there is no any virtio device, my command line like below
>>>>>>>>>>
>>>>>>>>>> x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
>>>>>>>>>> qemu64,-kvmclock -net none
>>>>>>>>>> -boot c -drive file=/home/lizj/ubuntu.raw -vnc :7 -m 2048 -smp
>>>>>>>>>> 2 -machine
>>>>>>>>>> usb=off -no-user-config -nodefaults -monitor stdio -vga std
>>>>>>>>>>
>>>>>>>>>> it seems easily to reproduce this issue by following steps in
>>>>>>>>>> _ubuntu_ guest
>>>>>>>>>> 1.  in source side, choose memtest in grub
>>>>>>>>>> 2. do live migration
>>>>>>>>>> 3. exit memtest(type Esc in when memory testing)
>>>>>>>>>> 4. wait migration complete
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes???it is a thorny problem. It is indeed easy to reproduce,
>>>>>>>>> just as
>>>>>>>>> your steps in the above.
>>>>>>>>>
>>>>>>>>> This is my test result: (I also test accel=tcg, it can be
>>>>>>>>> reproduced also.)
>>>>>>>>> Source side:
>>>>>>>>> # x86_64-softmmu/qemu-system-x86_64 -machine
>>>>>>>>> pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults
>>>>>>>>> -cpu qemu64,-kvmclock -boot c -drive
>>>>>>>>> file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw
>>>>>>>>> -device cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2
>>>>>>>>> -monitor stdio
>>>>>>>>> (qemu) ACPI_BUILD: init ACPI tables
>>>>>>>>> ACPI_BUILD: init ACPI tables
>>>>>>>>> migrate tcp:9.61.1.8:3004
>>>>>>>>> ACPI_BUILD: init ACPI tables
>>>>>>>>> before cpu_synchronize_all_states
>>>>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>>>>> md_host : before saving ram complete
>>>>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>>>>> md_host : after saving ram complete
>>>>>>>>> 5a8f72d66732cac80d6a0d5713654c0e
>>>>>>>>> (qemu)
>>>>>>>>>
>>>>>>>>> Destination side:
>>>>>>>>> # x86_64-softmmu/qemu-system-x86_64 -machine
>>>>>>>>> pc-i440fx-2.3,accel=kvm,usb=off -no-user-config -nodefaults
>>>>>>>>> -cpu qemu64,-kvmclock -boot c -drive
>>>>>>>>> file=/mnt/sdb/pure_IMG/ubuntu/ubuntu_14.04_server_64_2U_raw
>>>>>>>>> -device cirrus-vga,id=video0,vgamem_mb=8 -vnc :7 -m 2048 -smp 2
>>>>>>>>> -monitor stdio -incoming tcp:0:3004
>>>>>>>>> (qemu) QEMU_VM_SECTION_END, after loading ram
>>>>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>>>>> md_host : after loading all vmstate
>>>>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>>>>> md_host : after cpu_synchronize_all_post_init
>>>>>>>>> d7cb0d8a4bdd1557fb0e78baee50c986
>>>>>>>>
>>>>>>>> Hmm, that's not good.  I suggest you md5 each of the RAMBlock's
>>>>>>>> individually;
>>>>>>>> to see if it's main RAM that's different or something more subtle
>>>>>>>> like
>>>>>>>> video RAM.
>>>>>>>>
>>>>>>>
>>>>>>> Er, all my previous tests are md5 'pc.ram' block only.
>>>>>>>
>>>>>>>> But then maybe it's easier just to dump the whole of RAM to file
>>>>>>>> and byte compare it (hexdump the two dumps and diff ?)
>>>>>>>
>>>>>>> Hmm, we also used memcmp function to compare every page, but the
>>>>>>> addresses
>>>>>>> seem to be random.
>>>>>>>
>>>>>>> Besides, in our previous test, we found it seems to be more easy
>>>>>>> to reproduce
>>>>>>> when migration occurs during VM's start-up or reboot process.
>>>>>>>
>>>>>>> Is there any possible that some devices have special treatment
>>>>>>> when VM start-up
>>>>>>> which may miss setting dirty-bitmap ?
>>>>>>
>>>>>> I don't think there should be, but the code paths used during
>>>>>> startup are
>>>>>> probably much less tested with migration.  I'm sure the startup code
>>>>>> uses different part of device emulation.   I do know we have some
>>>>>> bugs
>>>>>
>>>>> Er, Maybe there is a special case:
>>>>>
>>>>> During VM's start-up, i found that the KVMslot changed many times,
>>>>> it was a process of
>>>>> smashing total memory space into small slot.
>>>>>
>>>>> If some pages was dirtied and its dirty-bitmap has been set in KVM
>>>>> module,
>>>>> but we didn't sync the bitmaps to QEMU user-space before this slot
>>>>> was smashed,
>>>>> with its previous bitmap been destroyed.
>>>>> The bitmap of dirty pages in the previous KVMslot maybe be missed.
>>>>>
>>>>> What's your opinion? Can this situation i described in the above
>>>>> happen?
>>>>>
>>>>> The bellow log was grabbed, when i tried to figure out a quite same
>>>>> question (some pages miss dirty-bitmap setting) we found in COLO:
>>>>> Occasionally, there will be an error report in SLAVE side:
>>>>>
>>>>>       qemu: warning: error while loading state for instance 0x0 of
>>>>> device
>>>>>       'kvm-tpr-opt'                                                 '
>>>>>       qemu-system-x86_64: loadvm failed
>>>>>
>>>>> We found that it related to three address (gpa:
>>>>> 0xca000,0xcb000,0xcc000, which are the address of 'kvmvapic.rom ?'),
>>>>> and sometimes
>>>>> their corresponding dirty-map will be missed in Master side, because
>>>>> their KVMSlot is destroyed before we sync its dirty-bitmap to qemu.
>>>>>
>>>>> (I'm still not quite sure if this can also happen in common
>>>>> migration, i will try to test it in normal migration)
>>>>
>>> Hi,
>>>
>>> We have found two bugs (places) that miss setting migration-bitmap of
>>> dirty pages,
>>> The virtio-blk related can be fixed by patch of Wen Congyang, you can
>>> find his reply in the list.
>>> And the 'kvm-tpr-opt' related can be fixed by the follow patch.
>>>
>>> Thanks,
>>> zhang
>>>
>>> >From 0c63687d0f14f928d6eb4903022a7981db6ba59f Mon Sep 17 00:00:00 2001
>>> From: zhanghailiang <address@hidden>
>>> Date: Thu, 2 Apr 2015 19:26:31 +0000
>>> Subject: [PATCH] kvm-all: Sync dirty-bitmap from kvm before kvm
>>> destroy the
>>>   corresponding dirty_bitmap
>>>
>>> Sometimes, we destroy the dirty_bitmap in kvm_memory_slot before any
>>> sync action
>>> occur, this bit in dirty_bitmap will be missed, and which will lead
>>> the corresponding
>>> dirty pages to be missed in migration.
>>>
>>> This usually happens when do migration during VM's Start-up or Reboot.
>>>
>>> Signed-off-by: zhanghailiang <address@hidden>
>>> ---
>>>   exec.c    | 2 +-
>>>   kvm-all.c | 4 +++-
>>>   2 files changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/exec.c b/exec.c
>>> index 874ecfc..4b1b39b 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -59,7 +59,7 @@
>>>   //#define DEBUG_SUBPAGE
>>>
>>>   #if !defined(CONFIG_USER_ONLY)
>>> -static bool in_migration;
>>> +bool in_migration;
>>>
>>>   /* ram_list is read under rcu_read_lock()/rcu_read_unlock().  Writes
>>>    * are protected by the ramlist lock.
>>> diff --git a/kvm-all.c b/kvm-all.c
>>> index 335438a..dd75eff 100644
>>> --- a/kvm-all.c
>>> +++ b/kvm-all.c
>>> @@ -128,6 +128,8 @@ bool kvm_allowed;
>>>   bool kvm_readonly_mem_allowed;
>>>   bool kvm_vm_attributes_allowed;
>>>
>>> +extern bool in_migration;
>>> +
>>>   static const KVMCapabilityInfo kvm_required_capabilites[] = {
>>>       KVM_CAP_INFO(USER_MEMORY),
>>>       KVM_CAP_INFO(DESTROY_MEMORY_REGION_WORKS),
>>> @@ -715,7 +717,7 @@ static void kvm_set_phys_mem(MemoryRegionSection
>>> *section, bool add)
>>>
>>>           old = *mem;
>>>
>>> -        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>>> +        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES || in_migration) {
>>>               kvm_physical_sync_dirty_bitmap(section);
>>>           }
>>>
>>> -- 
>>
>> I can still see XFS panic of complaining "Corruption of in-memory data
>> detected." in guest after migration even with this patch and IDE disk.
>>
>
> What's you command line of qemu ?
>
> Thanks,
> zhanghailiang
>
>

Really a simple cli:

$qemu_path $img_path -m 2G -enable-kvm -vga std -cpu host -netdev
tap,id=hn0 -device virtio-net-pci,netdev=hn0

I reproduce the issue by doing scp a file to guest during migration.
After 20 or even more times of migration, guest is stuck or panic. The
issue could also be reproduced when using e1000 or even without any card
(in this case, guest will stuck at login).

The issue seems related to kernel since when I switch to use RHEL7 as
host kernel. I can't reproduce the issue.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]