[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10
From: |
ChristianEhrhardt |
Subject: |
[Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10 |
Date: |
Mon, 21 Aug 2017 14:55:54 -0000 |
As expected by David when I trace on process_incoming_migration_co which prints
the "readable" error I see the error pop up on "qemu_loadvm_state"
It appears as "Thread 4 "CPU 0/KVM" received signal SIGUSR1" and similar which
is just the break down of the guest.
Diving "into" qemu_loadvm_state reveals that it gets until
"cpu_synchronize_all_pre_loadvm".
In qemu_loadvm_state none of the initial checks fail.
Then the "ret = vmstate_load_state(f, &vmstate_configuration, &savevm_state,
0);" seems to work fine was well.
It seems reproducible in "cpu_synchronize_all_pre_loadvm" where the crash
happens.
I can catch the incoming qemu easily with:
$ while ! pid=$(pidof qemu-system-x86_64); do /bin/true; done; gdb --pid ${pid}
# Then in gdb break on "cpu_synchronize_all_pre_loadvm"
# And when I step over it I the next thing I see is the "beginning of the end"
for the process
Thread 4 "CPU 0/KVM" received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x7f418136e700 (LWP 3887)]
__lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
The guest only has one vcpu, so CPU_FOREACH(cpu) is not much of a loop.
Looking down that path I tracked it to along:
cpu_synchronize_all_pre_loadvm -> cpu_synchronize_pre_loadvm ->
kvm_cpu_synchronize_pre_loadvm -> do_run_on_cpu
Here it queues the function "do_kvm_cpu_synchronize_pre_loadvm" onto the vcpu.
That is done via queue_work_on_cpu(cpu, &wi); which in turn uses eventually
"qemu_cpu_kick_thread(cpu);"
That seems to trigger the first SIGUSR1
Following that I get the breakpoint that I set at
"do_kvm_cpu_synchronize_pre_loadvm".
The actual function only sets "cpu->vcpu_dirty = true;" and works.
On the way out from there, there is a "qemu_kvm_wait_io_event" which
leads to the next SIGUSR1.
Going on I see another "do_run_on_cpu" with "vapic_do_enable_tpr_reporting" as
the function.
I set a breakpoint on that as well but took a full CPUstate before going on:
p *cpu
$4 = {parent_obj = {parent_obj = {class = 0x5ffe7170c0, free = 0x7f62328f15a0
<g_free>, properties = 0x5ffe736e40, ref = 1,
parent = 0x5ffe726160}, id = 0x0, realized = true, pending_deleted_event
= false, opts = 0x0, hotplugged = 0, parent_bus = 0x0, gpios = {
lh_first = 0x0}, child_bus = {lh_first = 0x0}, num_child_bus = 0,
instance_id_alias = -1, alias_required_for_version = 0}, nr_cores = 1,
nr_threads = 1, thread = 0x5ffe803cd0, thread_id = 8498, running = false,
has_waiter = false, halt_cond = 0x5ffe803cf0, thread_kicked = true,
created = true, stop = false, stopped = true, unplug = false, crash_occurred
= false, exit_request = false, interrupt_request = 4,
singlestep_enabled = 0, icount_budget = 0, icount_extra = 0, jmp_env =
{{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0,
__saved_mask = {__val = {0 <repeats 16 times>}}}}, work_mutex = {lock =
{__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next
= 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
initialized = true}, queued_work_first = 0x5fffefc990, queued_work_last =
0x5fffefc990, cpu_ases = 0x5ffe803c10, num_ases = 1,
as = 0x5ffe7f9690, memory = 0x5ffe725bd0, env_ptr = 0x5ffe7e44c0,
tb_jmp_cache = {0x0 <repeats 4096 times>}, gdb_regs = 0x0,
gdb_num_regs = 57, gdb_num_g_regs = 57, node = {tqe_next = 0x0, tqe_prev =
0x5ffc1783f0 <cpus>}, breakpoints = {tqh_first = 0x0,
tqh_last = 0x5ffe7e4430}, watchpoints = {tqh_first = 0x0, tqh_last =
0x5ffe7e4440}, watchpoint_hit = 0x0, opaque = 0x0, mem_io_pc = 0,
mem_io_vaddr = 0, kvm_fd = 19, kvm_state = 0x5ffe7357a0, kvm_run =
0x7f62374bc000, trace_dstate_delayed = {0}, trace_dstate = {0},
cpu_index = 0, halted = 1, can_do_io = 1, exception_index = -1, vcpu_dirty =
true, throttle_thread_scheduled = false, icount_decr = {u32 = 0,
u16 = {low = 0, high = 0}}, hax_vcpu = 0x0, pending_tlb_flush = 7}
Continuing I hit the "vapic_do_enable_tpr_reporting" with qemu still
running.
Things go on, the next candidate for "do_run_on_cpu" is "kvm_apic_put"
Another SIGUSR1 to get that kicked it seems.
"kvm_apic_put" breakpoint is reached after that kick.
Next for "do_run_on_cpu" is "do_kvm_cpu_synchronize_post_init". And
that triggered the fourth SIGUSR1. Before I only saw four, hopefully
that is the same with so much breakpoints.
Checked the cpu state again:
1880 static void do_kvm_cpu_synchronize_post_init(CPUState *cpu,
run_on_cpu_data arg)
1881 {
1882 kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
1883 cpu->vcpu_dirty = false;
1884 }
1885
(gdb) p cpu
$5 = (CPUState *) 0x5ffe7dc230
(gdb) p *cpu
$6 = {parent_obj = {parent_obj = {class = 0x5ffe7170c0, free = 0x7f62328f15a0
<g_free>, properties = 0x5ffe736e40, ref = 1,
parent = 0x5ffe726160}, id = 0x0, realized = true, pending_deleted_event
= false, opts = 0x0, hotplugged = 0, parent_bus = 0x0, gpios = {
lh_first = 0x0}, child_bus = {lh_first = 0x0}, num_child_bus = 0,
instance_id_alias = -1, alias_required_for_version = 0}, nr_cores = 1,
nr_threads = 1, thread = 0x5ffe803cd0, thread_id = 8498, running = false,
has_waiter = false, halt_cond = 0x5ffe803cf0, thread_kicked = false,
created = true, stop = false, stopped = true, unplug = false, crash_occurred
= false, exit_request = false, interrupt_request = 4,
singlestep_enabled = 0, icount_budget = 0, icount_extra = 0, jmp_env =
{{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0,
__saved_mask = {__val = {0 <repeats 16 times>}}}}, work_mutex = {lock =
{__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
__kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next
= 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
initialized = true}, queued_work_first = 0x0, queued_work_last = 0x0,
cpu_ases = 0x5ffe803c10, num_ases = 1, as = 0x5ffe7f9690,
memory = 0x5ffe725bd0, env_ptr = 0x5ffe7e44c0, tb_jmp_cache = {0x0 <repeats
4096 times>}, gdb_regs = 0x0, gdb_num_regs = 57,
gdb_num_g_regs = 57, node = {tqe_next = 0x0, tqe_prev = 0x5ffc1783f0 <cpus>},
breakpoints = {tqh_first = 0x0, tqh_last = 0x5ffe7e4430},
watchpoints = {tqh_first = 0x0, tqh_last = 0x5ffe7e4440}, watchpoint_hit =
0x0, opaque = 0x0, mem_io_pc = 0, mem_io_vaddr = 0, kvm_fd = 19,
kvm_state = 0x5ffe7357a0, kvm_run = 0x7f62374bc000, trace_dstate_delayed =
{0}, trace_dstate = {0}, cpu_index = 0, halted = 1, can_do_io = 1,
exception_index = -1, vcpu_dirty = true, throttle_thread_scheduled = false,
icount_decr = {u32 = 0, u16 = {low = 0, high = 0}},
hax_vcpu = 0x0, pending_tlb_flush = 7}
And from here stepping into kvm_arch_put_registers:
kvm_arch_put_registers (address@hidden, address@hidden) at
./target/i386/kvm.c:2591
That still is the same vcpu as all the time, x86_cpu is optimized out
unfortunately as I had no full debug build with -O0.
I see it setting up regs in kvm_arch_put_registers without error (all ret=0)
and return to do_kvm_cpu_synchronize_post_init.
This eventually sets "cpu->vcpu_dirty = false;"
After this seems all good I steped through the "way out" and there came another
"qemu_kvm_wait_io_event(cpu);".
Without considering this being critical I stepped with "n" and qemu was gone
with all its threads.
qemu_kvm_cpu_thread_fn (arg=0x5ffe7dc230) at ./cpus.c:1134
1134 } while (!cpu->unplug || cpu_can_run(cpu));
(gdb) n
1127 if (cpu_can_run(cpu)) {
(gdb) n
1133 qemu_kvm_wait_io_event(cpu);
(gdb) n
[Thread 0x7f6227857700 (LWP 8498) exited]
--
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1711602
Title:
--copy-storage-all failing with qemu 2.10
Status in QEMU:
New
Status in libvirt package in Ubuntu:
Confirmed
Status in qemu package in Ubuntu:
Confirmed
Bug description:
We fixed an issue around disk locking already in regard to qemu-nbd
[1], but there still seem to be issues.
$ virsh migrate --live --copy-storage-all kvmguest-artful-normal
qemu+ssh://10.22.69.196/system
error: internal error: qemu unexpectedly closed the monitor:
2017-08-18T12:10:29.800397Z qemu-system-x86_64: -chardev pty,id=charserial0:
char device redirected to /dev/pts/0 (label charserial0)
2017-08-18T12:10:48.545776Z qemu-system-x86_64: load of migration failed:
Input/output error
Source libvirt log for the guest:
2017-08-18 12:09:08.251+0000: initiating migration
2017-08-18T12:09:08.809023Z qemu-system-x86_64: Unable to read from socket:
Connection reset by peer
2017-08-18T12:09:08.809481Z qemu-system-x86_64: Unable to read from socket:
Connection reset by peer
Target libvirt log for the guest:
2017-08-18T12:09:08.730911Z qemu-system-x86_64: load of migration failed:
Input/output error
2017-08-18 12:09:09.010+0000: shutting down, reason=crashed
Given the timing it seems that the actual copy now works (it is busy ~10
seconds on my environment which would be the copy).
Also we don't see the old errors we saw before, but afterwards on the actual
take-over it fails.
Dmesg has no related denials as often apparmor is in the mix.
Need to check libvirt logs of source [2] and target [3] in Detail.
[1]: https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg02200.html
[2]: http://paste.ubuntu.com/25339356/
[3]: http://paste.ubuntu.com/25339358/
To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1711602/+subscriptions
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, (continued)
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/18
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/18
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/18
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/18
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/18
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/18
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/18
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, Dr. David Alan Gilbert, 2017/08/18
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/21
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/21
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10,
ChristianEhrhardt <=
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/21
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, Dr. David Alan Gilbert, 2017/08/21
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/22
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/22
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/22
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/22
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/22
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, Dr. David Alan Gilbert, 2017/08/22
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, Dr. David Alan Gilbert, 2017/08/22
- [Qemu-devel] [Bug 1711602] Re: --copy-storage-all failing with qemu 2.10, ChristianEhrhardt, 2017/08/22