Re: [PATCH] backup: don't acquire aio_context in backup

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] backup: don't acquire aio_context in backup_clean

From:	Stefan Reiter
Subject:	Re: [PATCH] backup: don't acquire aio_context in backup_clean
Date:	Thu, 26 Mar 2020 10:43:47 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0

On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:

25.03.2020 18:50, Stefan Reiter wrote:
backup_clean is only ever called as a handler via job_exit, which
Hmm.. I'm afraid it's not quite correct.

job_clean

   job_finalize_single

      job_completed_txn_abort (lock aio context)

      job_do_finalize
Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lockaio context..And on the same time, it directaly calls job_txn_apply(job->txn,job_finalize_single)
without locking. Is it a bug?

I think, as you say, the idea is that job_do_finalize is always calledwith the lock acquired. That's why job_completed_txn_abort takes care torelease the lock (at least of the "outer_ctx" as it calls it) beforereacquiring it.

And, even if job_do_finalize called always with locked context, where isguarantee that all
context of all jobs in txn are locked?

I also don't see anything that guarantees that... I guess it could beadapted to handle locks like job_completed_txn_abort does?

Haven't looked into transactions too much, but does it even make senseto have jobs in different contexts in one transaction?

Still, let's look through its callers.

       job_finalize

                qmp_block_job_finalize (lock aio context)
                qmp_job_finalize (lock aio context)
                test_cancel_concluded (doesn't lock, but it's a test)

           job_completed_txn_success

                job_completed

                     job_exit (lock aio context)

                     job_cancel

                          blockdev_mark_auto_del (lock aio context)

                          job_user_cancel

                              qmp_block_job_cancel (locks context)
                              qmp_job_cancel  (locks context)

                          job_cancel_err

job_cancel_sync (returnjob_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just callscallback)

replication_close (it's.bdrv_close.. Hmm, I don't see context locking, where is it ?)

Hm, don't see it either. This might indeed be a way to get to job_cleanwithout a lock held.

I don't have any testing set up for replication atm, but if you believethis would be correct I can send a patch for that as well (just acquirethe lock in replication_close before job_cancel_async?).


                                    replication_stop (locks context)

                                    drive_backup_abort (locks context)

                                    blockdev_backup_abort (locks context)

                                    job_cancel_sync_all (locks context)

                                    cancel_common (locks context)

                          test_* (I don't care)

To clarify, aside from the commit message the patch itself does notappear to be wrong? All paths (aside from replication_close mentionedabove) guarantee the job lock to be held.

already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since theBDRV_POLL_WHILEin bdrv_backup_top_drop -> bdrv_do_drained_begin would only releasethe lock
once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter <address@hidden>

Just note, that this thing were recently touched by 0abf2581717a19 , soadd Sergio (its author) to CC.

---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm tryingto figureout why that happens, stack trace indicates a similar problem with themainthread hanging at bdrv_do_drained_begin, though I have no clue why asof yet.


  block/backup.c | 4 ----
  1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
  static void backup_clean(Job *job)
  {
      BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-    AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-    aio_context_acquire(aio_context);
      bdrv_backup_top_drop(s->backup_top);
-    aio_context_release(aio_context);
  }
  void backup_do_checkpoint(BlockJob *job, Error **errp)

[Prev in Thread]

Current Thread

[Next in Thread]

Re: backup transaction with io-thread core dumps, (continued)

Prev by Date: Re: [PATCH 2/2] util/bufferiszero: improve avx2 accelerator
Next by Date: Re: 答复: [question]vhost-user: atuo fix network link broken during migration
Previous by thread: Re: [PATCH] backup: don't acquire aio_context in backup_clean
Next by thread: Re: [PATCH] backup: don't acquire aio_context in backup_clean
Index(es):
- Date
- Thread