On Wed, May 18, 2022 at 02:43:50PM +0200, Paolo Bonzini wrote: > On 5/18/22 14:28, Emanuele Giuseppe Esposito wrote: > > For example, all callers of bdrv_open() always take the AioContext lock. > > Often it is taken very high in the call stack, but it's always taken. > > I think it's actually not a problem of who takes the AioContext lock or > where; the requirements are contradictory: > > * IO_OR_GS_CODE() functions, when called from coroutine context, expect to > be called with the AioContext lock taken (example: bdrv_co_yield_to_drain) > > * to call these functions with the lock taken, the code has to run in the > BDS's home iothread. Attempts to do otherwise results in deadlocks (the > main loop's AIO_WAIT_WHILEs expect progress from the iothread, that cannot > happen without releasing the aiocontext lock) > > > * running the code in the BDS's home iothread is not possible for > GLOBAL_STATE_CODE() functions (unless the BDS home iothread is the main > thread, but that cannot be guaranteed in general) > > > We might suppose that many callbacks are called under drain and in > > GLOBAL_STATE, which should be enough, but from our experimentation in > > the previous series we saw that currently not everything is under drain, > > leaving some operations unprotected (remember assert_graph_writable > > temporarily disabled, since drain coverage for bdrv_replace_child_noperm > > was not 100%?). > > Therefore we need to add more drains. But isn't drain what we decided to > > drop at the beginning? Why isn't drain good? > > To sum up the patch ordering deadlock that we have right now: > > * in some cases, graph manipulations are protected by the AioContext lock > > * eliminating the AioContext lock is needed to move callbacks to coroutine > contexts (see above for the deadlock scenario) > > * moving callbacks to coroutine context is needed by the graph rwlock > implementation > > On one hand, we cannot protect the graph across manipulations with a graph > rwlock without removing the AioContext lock; on the other hand, the > AioContext lock is what _right now_ protects the graph. > > So I'd rather go back to Emanuele's draining approach. It may not be > beautiful, but it allows progress. Once that is in place, we can remove the > AioContext lock (which mostly protects virtio-blk/virtio-scsi code right > now) and reevaluate our next steps. Me too, I don't think the rwlock was particularly nice either. Stefan