Am 05.09.23 um 12:01 schrieb Fiona Ebner:
Can we assume block_job_remove_all_bdrv() to always hold the job's
AioContext? And if yes, can we just tell bdrv_graph_wrlock() that it
needs to release that before polling to fix the deadlock?
I tried a doing something similar as a proof-of-concept
diff --git a/blockjob.c b/blockjob.c
index 58c5d64539..1a696241a0 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -198,19 +198,19 @@ void block_job_remove_all_bdrv(BlockJob *job)
* one to make sure that such a concurrent access does not attempt
* to process an already freed BdrvChild.
*/
- bdrv_graph_wrlock(NULL);
while (job->nodes) {
GSList *l = job->nodes;
BdrvChild *c = l->data;
job->nodes = l->next;
+ bdrv_graph_wrlock(c->bs);
bdrv_op_unblock_all(c->bs, job->blocker);
bdrv_root_unref_child(c);
+ bdrv_graph_wrunlock();
g_slist_free_1(l);
}
- bdrv_graph_wrunlock();
}
and while it did get slightly further, I ran into another deadlock with
#0 0x00007f1941155136 in __ppoll (fds=0x55992068fb20, nfds=2, timeout=<optimized
out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1 0x000055991c6a1a3f in qemu_poll_ns (fds=0x55992068fb20, nfds=2, timeout=-1)
at ../util/qemu-timer.c:339
#2 0x000055991c67ed6c in fdmon_poll_wait (ctx=0x55991f058810,
ready_list=0x7ffda8c987b0, timeout=-1) at ../util/fdmon-poll.c:79
#3 0x000055991c67e6a8 in aio_poll (ctx=0x55991f058810, blocking=true) at
../util/aio-posix.c:670
#4 0x000055991c50a763 in bdrv_graph_wrlock (bs=0x0) at
../block/graph-lock.c:145