[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and
From: |
Dan Streetman |
Subject: |
[Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu |
Date: |
Mon, 15 Apr 2019 19:26:22 -0000 |
** Description changed:
[impact]
on shutdown of a guest, there is a race condition that results in qemu
crashing instead of normally shutting down. The bt looks similar to
this (depending on the specific version of qemu, of course; this is
taken from 2.5 version of qemu):
(gdb) bt
#0 __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
#1 0x00005636c0bc4389 in qemu_mutex_lock (address@hidden) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
#2 0x00005636c0988130 in qemu_chr_fe_write_all (address@hidden,
address@hidden "\v", address@hidden) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
#3 0x00005636c08f3483 in vhost_user_write (address@hidden, address@hidden,
address@hidden, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
#4 0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70,
ring=0x7ffe65c087e0) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
#5 0x00005636c08efff0 in vhost_virtqueue_stop (address@hidden,
address@hidden, vq=0x5636c1bf6d00, idx=1) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
#6 0x00005636c08f2944 in vhost_dev_stop (address@hidden, address@hidden) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
#7 0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70,
address@hidden) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
#8 0x00005636c08dbe5b in vhost_net_stop (address@hidden, ncs=0x5636c209d110,
address@hidden) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
#9 0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a',
n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
#10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
#11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338,
val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
#12 0x00005636c098fed2 in vm_state_notify (address@hidden, address@hidden) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
#13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
#14 vm_stop (state=RUN_STATE_SHUTDOWN) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
#15 0x00005636c085d240 in main_loop_should_exit () at
/build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
#16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
#17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683
[test case]
unfortunately since this is a race condition, it's very hard to
arbitrarily reproduce; it depends very much on the overall configuration
of the guest as well as how exactly it's shut down - specifically, its
vhost user net must be closed from the host side at a specific time
during qemu shutdown.
I have someone with such a setup who has reported to me their setup is
able to reproduce this reliably, but the config is too complex for me to
reproduce so I have relied on their reproduction and testing to debug
and craft the patch for this.
[regression potential]
- the change adds flags to prevent repeated calls to both vhost_net_stop()
- and vhost_net_cleanup() (really, prevents repeated calls to
- vhost_dev_cleanup(), but vhost_net_cleanup() does nothing else). Any
- regression would be seen when stopping and/or cleaning up a vhost net.
- Regressions might include failure to hot-remove a vhost net from a
- guest, or failure to cleanup (i.e. mem leak), or crashes during cleanup
- or stopping a vhost net.
-
- However, the flags are very unintrusive, and only in the shutdown path
- (of a vhost_dev or vhost_net), and are unlikely to cause any
- regressions.
+ the change adds a flag to prevent repeated calls to vhost_net_stop().
+ This also prevents any calls to vhost_net_cleanup() from
+ net_vhost_user_event(). Any regression would be seen when stopping
+ and/or cleaning up a vhost net. Regressions might include failure to
+ hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
+ leak), or crashes during cleanup or stopping a vhost net.
[other info]
this was originally seen in the 2.5 version of qemu - specifically, the
UCA version in trusty-mitaka (which uses the xenial qemu codebase).
However, this appears to still apply upstream, and I am sending a patch
to the qemu list to patch upstream as well.
The specific race condition for this (in the qemu 2.5 code version) is:
as shown in above bt, thread A starts shutting down qemu, e.g.:
vm_stop->do_vm_stop->vm_state_notify
virtio_set_status
virtio_net_set_status
virtio_net_vhost_status
in this function, code gets to an if-else check for (!n->vhost_started),
which is false (i.e. vhost_started is true) and enters the else code
block, which calls vhost_net_stop() and then sets n->vhost_started to
false.
While thread A is inside vhost_net_stop(), thread B is triggered by
the vhost net chr handler with a user event and calls:
net_vhost_user_event
qmp_set_link (from case CHR_EVENT_CLOSED)
virtio_net_set_link_status (via ->link_status_changed)
virtio_net_set_status
virtio_net_vhost_status
notice thread B has now reached the same function that thread A is in;
since the checks in the function have not changed, thread B follows the
same path that thread A followed, and enters vhost_net_stop().
Since thread A has already shut down and cleaned up some of the
internals, once thread B starts trying to also clean up things, it
segfaults as the shown in the bt.
Avoiding only this duplicate call to vhost_net_stop() is required, but
not enough - let's continue to look at what thread B does after its call
to qmp_set_link() returns:
net_vhost_user_event
vhost_user_stop
vhost_net_cleanup
vhost_dev_cleanup
However, in main() qemu registers atexit(net_cleanup()), which does:
net_cleanup
qemu_del_nic (or qemu_del_net_client, depending on ->type)
qemu_cleanup_net_client
vhost_user_cleanup (via ->cleanup)
vhost_net_cleanup
vhost_dev_cleanup
and the duplicate vhost_dev_cleanup fails assertions since things were
- already cleaned up.
+ already cleaned up. Additionally, if thread B's call to
+ vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
+ then that will call vhost_dev_stop() and vhost_disable_notifiers() which
+ both try to access things that have been freed/cleared/disabled by
+ vhost_dev_cleanup().
--
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1823458
Title:
race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown
crashes qemu
Status in QEMU:
In Progress
Status in qemu package in Ubuntu:
In Progress
Status in qemu source package in Trusty:
In Progress
Status in qemu source package in Xenial:
In Progress
Status in qemu source package in Bionic:
In Progress
Status in qemu source package in Cosmic:
In Progress
Status in qemu source package in Disco:
In Progress
Bug description:
[impact]
on shutdown of a guest, there is a race condition that results in qemu
crashing instead of normally shutting down. The bt looks similar to
this (depending on the specific version of qemu, of course; this is
taken from 2.5 version of qemu):
(gdb) bt
#0 __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
#1 0x00005636c0bc4389 in qemu_mutex_lock (address@hidden) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/util/qemu-thread-posix.c:73
#2 0x00005636c0988130 in qemu_chr_fe_write_all (address@hidden,
address@hidden "\v", address@hidden) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/qemu-char.c:205
#3 0x00005636c08f3483 in vhost_user_write (address@hidden, address@hidden,
address@hidden, dev=0x5636c1bf6b70, dev=0x5636c1bf6b70)
at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:195
#4 0x00005636c08f411c in vhost_user_get_vring_base (dev=0x5636c1bf6b70,
ring=0x7ffe65c087e0) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost-user.c:364
#5 0x00005636c08efff0 in vhost_virtqueue_stop (address@hidden,
address@hidden, vq=0x5636c1bf6d00, idx=1) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:895
#6 0x00005636c08f2944 in vhost_dev_stop (address@hidden, address@hidden) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/vhost.c:1262
#7 0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf6b70,
address@hidden) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:293
#8 0x00005636c08dbe5b in vhost_net_stop (address@hidden, ncs=0x5636c209d110,
address@hidden) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/vhost_net.c:371
#9 0x00005636c08d7745 in virtio_net_vhost_status (status=7 '\a',
n=0x5636c2853338) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:150
#10 virtio_net_set_status (vdev=<optimized out>, status=<optimized out>) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/hw/net/virtio-net.c:162
#11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c2853338,
val=<optimized out>) at /build/qemu-7I4i1R/qemu-2.5+dfsg/hw/virtio/virtio.c:624
#12 0x00005636c098fed2 in vm_state_notify (address@hidden, address@hidden) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1605
#13 0x00005636c089172a in do_vm_stop (state=RUN_STATE_SHUTDOWN) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:724
#14 vm_stop (state=RUN_STATE_SHUTDOWN) at
/build/qemu-7I4i1R/qemu-2.5+dfsg/cpus.c:1407
#15 0x00005636c085d240 in main_loop_should_exit () at
/build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1883
#16 main_loop () at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:1931
#17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
at /build/qemu-7I4i1R/qemu-2.5+dfsg/vl.c:4683
[test case]
unfortunately since this is a race condition, it's very hard to
arbitrarily reproduce; it depends very much on the overall
configuration of the guest as well as how exactly it's shut down -
specifically, its vhost user net must be closed from the host side at
a specific time during qemu shutdown.
I have someone with such a setup who has reported to me their setup is
able to reproduce this reliably, but the config is too complex for me
to reproduce so I have relied on their reproduction and testing to
debug and craft the patch for this.
[regression potential]
the change adds a flag to prevent repeated calls to vhost_net_stop().
This also prevents any calls to vhost_net_cleanup() from
net_vhost_user_event(). Any regression would be seen when stopping
and/or cleaning up a vhost net. Regressions might include failure to
hot-remove a vhost net from a guest, or failure to cleanup (i.e. mem
leak), or crashes during cleanup or stopping a vhost net.
[other info]
this was originally seen in the 2.5 version of qemu - specifically,
the UCA version in trusty-mitaka (which uses the xenial qemu
codebase). However, this appears to still apply upstream, and I am
sending a patch to the qemu list to patch upstream as well.
The specific race condition for this (in the qemu 2.5 code version)
is:
as shown in above bt, thread A starts shutting down qemu, e.g.:
vm_stop->do_vm_stop->vm_state_notify
virtio_set_status
virtio_net_set_status
virtio_net_vhost_status
in this function, code gets to an if-else check for
(!n->vhost_started), which is false (i.e. vhost_started is true) and
enters the else code block, which calls vhost_net_stop() and then sets
n->vhost_started to false.
While thread A is inside vhost_net_stop(), thread B is triggered by
the vhost net chr handler with a user event and calls:
net_vhost_user_event
qmp_set_link (from case CHR_EVENT_CLOSED)
virtio_net_set_link_status (via ->link_status_changed)
virtio_net_set_status
virtio_net_vhost_status
notice thread B has now reached the same function that thread A is in;
since the checks in the function have not changed, thread B follows
the same path that thread A followed, and enters vhost_net_stop().
Since thread A has already shut down and cleaned up some of the
internals, once thread B starts trying to also clean up things, it
segfaults as the shown in the bt.
Avoiding only this duplicate call to vhost_net_stop() is required, but
not enough - let's continue to look at what thread B does after its
call to qmp_set_link() returns:
net_vhost_user_event
vhost_user_stop
vhost_net_cleanup
vhost_dev_cleanup
However, in main() qemu registers atexit(net_cleanup()), which does:
net_cleanup
qemu_del_nic (or qemu_del_net_client, depending on ->type)
qemu_cleanup_net_client
vhost_user_cleanup (via ->cleanup)
vhost_net_cleanup
vhost_dev_cleanup
and the duplicate vhost_dev_cleanup fails assertions since things were
already cleaned up. Additionally, if thread B's call to
vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(),
then that will call vhost_dev_stop() and vhost_disable_notifiers()
which both try to access things that have been freed/cleared/disabled
by vhost_dev_cleanup().
To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1823458/+subscriptions
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/06
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/11
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/11
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu,
Dan Streetman <=
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/23
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/23
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Launchpad Bug Tracker, 2019/04/23
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Robie Basak, 2019/04/24
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Corey Bryant, 2019/04/24
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Corey Bryant, 2019/04/24
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/24
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/24
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/30
- [Qemu-devel] [Bug 1823458] Re: race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu, Dan Streetman, 2019/04/30