qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PULL 21/35] block: fix QEMU crash with sc


From: Eric Blake
Subject: Re: [Qemu-block] [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del
Date: Mon, 6 Aug 2018 17:04:28 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

On 06/18/2018 11:44 AM, Kevin Wolf wrote:
From: Greg Kurz <address@hidden>

Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.

...
The problem is that we should avoid making block driver graph
changes while we have in-flight requests. Let's drain all I/O
for this BB before calling bdrv_root_unref_child().

Signed-off-by: Greg Kurz <address@hidden>
Signed-off-by: Kevin Wolf <address@hidden>
---
  block/block-backend.c | 5 +++++
  1 file changed, 5 insertions(+)

Interestingly enough, git bisect is reliably pointing to this commit as the culprit in the changed output of iotests './check -nbd 83':

083 8s ... - output mismatch (see 083.out.bad)
--- /home/eblake/qemu/tests/qemu-iotests/083.out 2018-02-26 11:40:31.605792220 -0600 +++ /home/eblake/qemu/tests/qemu-iotests/083.out.bad 2018-08-06 16:57:03.411861660 -0500
@@ -41,7 +41,6 @@

 === Check disconnect after neg2 ===

-Connection closed
 read failed: Input/output error

 === Check disconnect 8 neg2 ===
@@ -54,7 +53,6 @@

 === Check disconnect before request ===

-Connection closed
 read failed: Input/output error

 === Check disconnect after request ===
@@ -116,7 +114,6 @@

 === Check disconnect after neg-classic ===

-Connection closed
 read failed: Input/output error

 === Check disconnect before neg1 ===
Failures: 083
Failed 1 of 1 tests



diff --git a/block/block-backend.c b/block/block-backend.c
index 2d1a3463e8..6b75bca317 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -767,6 +767,11 @@ void blk_remove_bs(BlockBackend *blk)
blk_update_root_state(blk); + /* bdrv_root_unref_child() will cause blk->root to become stale and may
+     * switch to a completion coroutine later on. Let's drain all I/O here
+     * to avoid that and a potential QEMU crash.
+     */
+    blk_drain(blk);
      bdrv_root_unref_child(blk->root);
      blk->root = NULL;
  }


Test 83 sets up a client that intentionally disconnects at critical points in the NBD protocol exchange, to ensure that the server reacts sanely. I suspect that somewhere in the NBD code, the server detects the disconnect and was somehow calling into blk_remove_bs() (although I could not quickly find the backtrace); and that prior to this patch, the 'Connection closed' message resulted from other NBD coroutines getting a shot at the (now-closed) connection, while after this patch, the additional blk_drain() somehow tweaks things in a way that prevents the other NBD coroutines from printing a message. If so, then the change in 83 reference output is probably intentional, and we should update it.

But I'm having a hard time convincing myself that this is the case, particularly since I'm not even sure how to easily debug the assumptions I made above.

Since I'm very weak on the whole notion of what blk_drain() vs. blk_remove_bs() is really supposed to be doing, and could easily be persuaded that the change in output is a regression instead of a fix.

Thoughts?

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]