[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
From: |
Fei Li |
Subject: |
Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues |
Date: |
Fri, 26 Oct 2018 20:59:26 +0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
On 10/25/2018 08:55 PM, Dr. David Alan Gilbert wrote:
* Fei Li (address@hidden) wrote:
Hi,
these two patches are to fix live migration issues. The first is
about multifd, and the second is to fix some error handling.
But I have a question about using multifd migration.
In our current code, when multifd is used during migration, if there
is an error before the destination receives all new channels (I mean
multifd_recv_new_channel(ioc)), the destination does not exit but
keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
the source exits.
My question is about the state of the destination host if fails during
this period. I did a test, after applying [1/2] patch, if
multifd_new_send_channel_async() fails, the destination host hangs for
a while then later pops up a window saying
"'QEMU (...) [stopped]' is not responding.
You may choose to wait a short while for it to continue or force
the application to quit entirely."
But after closing the window by clicking, the qemu on the dest still
hangs there until I exclusively kill the qemu on the source.
That sounds like the main thread is blocked for some reason?
Yes, the main thread on the dst is keeps looping.
But I don't
normally use the window setup; if you try with -nographic and can see
the HMP (or a QMP) monitor, can you see if the monitor still responds?
Thanks for the `-nographic` reminder, I harvested an interesting
phenonmenon:
If I do the `migrate -d tcp:ip_addr:port` before the guest's graphic appears
(it's dark now), there is no hang and the guest starts up properly later.
But if I do the live migration after the guest fully starts up, I mean when
I can operate something using my mouse inside the guest, the hang
situation is there.
This is true for using `-nographic` for both src and dst,
and using `-nographic` for only src or dst.
The hang phenonmenon is that the dst seems never responds (I
waited three minutes), and the cursor just keeps flashing. After I
exclusively kill the src, then the dst quit. Just as follows:
(Same result if gdb is not used in src)
src:
(qemu) ...
(qemu) q
(gdb) q
dst:
(qemu) Up to now, dst has received the 0 channel
Up to now, dst has received the 1 channel
(qemu)
(qemu)
To check the migtation state in the src:
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off
release-ram: off block: off return-path: off pause-before-switchover:
off x-multifd: on dirty-bitmaps: off postcopy-blocktime: off
late-block-activate: off
Migration status: setup /* I added some codes to set the status to
"failed", but still not working, details see below */
total time: 0 milliseconds
I guess maybe the source should to proactive to tell the dst and
disconnects from the source side, so I tried to set the above
"Migration status" to be "failed", and use qemu_fclose(s->to_dst_file)
when multifd_new_send_channel_async() fails.
(BTW: I even tried:
if (s->vm_was_running) { vm_start(); } )
But the hang situation is still there.
If it doesn't then try and get a backtrace.
The monitor really shouldn't block, so it would be interesting to see.
Dave
I set two breakpoints and get the following backtrace, hope they can
help. :)
Thread 1 "qemu-system-x86" hit Breakpoint 1, multifd_recv_new_channel (
ioc=0x555557995af0) at /build/gitcode/qemu-build/migration/ram.c:1368
1368 {
(gdb) c
Continuing.
Thread 1 "qemu-system-x86" hit Breakpoint 2, qio_channel_socket_readv (
ioc=0x555557995af0, iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0,
errp=0x7fffffffdb38) at io/channel-socket.c:463
463 {
(gdb) n
464 QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
(gdb)
......
483 retry:
(gdb)
484 ret = recvmsg(sioc->fd, &msg, sflags);
(gdb) bt
#0 qio_channel_socket_readv (ioc=0x555557995af0, iov=0x5555568777d0,
niov=1,
fds=0x0, nfds=0x0, errp=0x7fffffffdb38) at io/channel-socket.c:484
#1 0x0000555555d156c5 in qio_channel_readv_full (ioc=0x555557995af0,
iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0, errp=0x7fffffffdb38)
at io/channel.c:65
#2 0x0000555555d15b26 in qio_channel_readv (ioc=0x555557995af0,
iov=0x5555568777d0, niov=1, errp=0x7fffffffdb38) at io/channel.c:197
#3 0x0000555555d15853 in qio_channel_readv_all_eof (ioc=0x555557995af0,
iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:106
#4 0x0000555555d1595c in qio_channel_readv_all (ioc=0x555557995af0,
iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:142
#5 0x0000555555d15d0c in qio_channel_read_all (ioc=0x555557995af0,
buf=0x7fffffffdad0 "\340\"zVUU", buflen=25, errp=0x7fffffffdb38)
at io/channel.c:246
#6 0x000055555587695c in multifd_recv_initial_packet (c=0x555557995af0,
errp=0x7fffffffdb38) at /build/gitcode/qemu-build/migration/ram.c:653
#7 0x00005555558788fb in multifd_recv_new_channel (ioc=0x555557995af0)
at /build/gitcode/qemu-build/migration/ram.c:1374
#8 0x0000555555bc9978 in migration_ioc_process_incoming
(ioc=0x555557995af0)
at migration/migration.c:573
#9 0x0000555555bd0c69 in migration_channel_process_incoming
(ioc=0x555557995af0)
at migration/channel.c:47
#10 0x0000555555bcf7e8 in socket_accept_incoming_migration (
listener=0x5555578dcae0, cioc=0x555557995af0, opaque=0x0)
at migration/socket.c:166
#11 0x0000555555d2051f in qio_net_listener_channel_func
(ioc=0x5555579c7180,
condition=G_IO_IN, opaque=0x5555578dcae0) at io/net-listener.c:53
#12 0x0000555555d1c0a2 in qio_channel_fd_source_dispatch
(source=0x5555568d5970,
---Type <return> to continue, or q <return> to quit---
callback=0x555555d20473 <qio_net_listener_channel_func>,
user_data=0x5555578dcae0) at io/channel-watch.c:84
#13 0x00007ffff6353dc5 in g_main_context_dispatch ()
from /usr/lib64/libglib-2.0.so.0
#14 0x0000555555d7d1ad in glib_pollfds_poll () at util/main-loop.c:215
#15 0x0000555555d7d227 in os_host_main_loop_wait (timeout=0) at
util/main-loop.c:238
#16 0x0000555555d7d2e0 in main_loop_wait (nonblocking=0) at
util/main-loop.c:497
#17 0x00005555559cd679 in main_loop () at vl.c:1884
#18 0x00005555559d4f1e in main (argc=32, argv=0x7fffffffe0b8,
envp=0x7fffffffe1c0)
at vl.c:4618
(gdb) n
Thread 1 "qemu-system-x86" received signal SIGINT, Interrupt.
0x00007ffff5606f64 in recvmsg () from /lib64/libpthread.so.0
(gdb) c
Continuing.
After I input above `n`, the dst just hangs here, seems waiting for the
result of
recvmsg(sioc->fd, &msg, sflags); Later even I use ctrl+c to kill it, the
dst still hangs.
Have a nice day, thanks
Fei
The source host keeps running as expected, but I guess the hang
phenonmenon in the dest is not right.
Would someone kindly give some suggestions on this? Thanks a lot.
Fei Li (2):
migration: fix the multifd code
migration: fix some error handling
migration/migration.c | 5 +----
migration/postcopy-ram.c | 3 +++
migration/ram.c | 33 +++++++++++++++++++++++----------
migration/ram.h | 2 +-
4 files changed, 28 insertions(+), 15 deletions(-)
--
2.13.7
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK