[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [Qemu-devel] [PATCH v2 04/11] blockjobs: Always use blo
From: |
John Snow |
Subject: |
Re: [Qemu-block] [Qemu-devel] [PATCH v2 04/11] blockjobs: Always use block_job_get_aio_context |
Date: |
Wed, 12 Oct 2016 20:49:41 -0400 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 |
As context to everyone else as to why I'm going down the rabbit hole of
trying to remove external references to AioContext at all, see
https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg00795.html
On 10/07/2016 03:49 AM, Paolo Bonzini wrote:
On 06/10/2016 22:22, John Snow wrote:
Calls .complete(), for which the only implementation is
mirror_complete. Uh, this actually seems messy. Looks like there's
nothing to prevent us from calling this after we've already told it to
complete once.
Yeah, it should have an
if (s->should_complete) {
return;
}
at the beginning. I have other mirror.c patches in my queue so I can
take care of that.
Or something up the stack at block_job_complete so it's not up to job
implementations? What if the next implementer "forgets."
block_job_cancel and block_job_complete are different.
block_job_cancel is called in many places, but we can just add a similar
block_job_user_cancel if we wanted a version which takes care to acquire
context and one that does not. (Or we could just acquire the context
regardless, but Paolo warned me ominously that recursive locks are EVIL.
He sounded serious.)
Not that many places:
- block_job_finish_sync calls it, and you can just release/reacquire
around the call to "finish(job, &local_err)".
This makes me a little nervous because we went through the trouble of
creating this callback, but we're going to assume we know that it's a
public interface that will take the lock for itself (or otherwise does
not require a lock.)
In practice it works, but it seems needlessly mystifying in terms of
proving correctness.
- there are two callers in blockdev.c, and you can just remove the
acquire/release from blockdev.c if you push it in block_job_cancel.
Makes sense; I don't like the association of (bs :: job) here anyway.
Again we're grabbing context for a job where that job may not even be
running.
As to block_job_cancel_sync:
Which I didn't audit, because no callers use job->blk to get the
AioContext before calling this; they use bs if bs->job is present.
- replication_stop is not acquiring s->secondary_disk->bs's AioContext.
Seems like a bug on their part. Would be fixed by having cancel acquire
context for itself.
- there is no need to hold the AioContext between ->prepare and ->clean.
My suggestion is to ref the blockjob after storing it in state->job
(you probably should do that anyway) and unref'ing it in ->clean. Then
you can call again let block_job_cancel_sync(bs->job) take the
AioContext, which it will do in block_job_finish_sync.
Yeah, I should be reffing it anyway.
The rest of this... What I think you mean is acquiring and releasing the
context as needed for EACH of prepare, commit, abort, and clean as
necessary, right?
And then in this case, it simply wouldn't be necessary for abort, as the
sync cancel would do it for us.
block_job_complete has no direct callers outside of QMP, but it is also
used as a callback by block_job_complete_sync, used in qemu-img for
run_block_job. I can probably rewrite qemu_img to avoid this usage.
No need to: qemu-img is not acquiring the AioContext, so it's okay to
let block_job_complete do that (like block_job_cancel,
block_job_complete will be called by block_job_finish_sync without the
AioContext acquired).
Eh? Oh, you're right, it just gets it for the sake of aio_poll.
Paolo
Alright.
Say I *do* push the acquisitions down into blockjob.c. What benefit does
that provide? Won't I still need the block_job_get_aio_context()
function (At least internally) to know which context to acquire? This
would preclude you from deleting it.
Plus... we remove some fairly simple locking mechanisms and then inflate
it tenfold. I'm not convinced this is an improvement.
As context and a refresher (for me when I re-read this email in 12
hours,) there are three places externally that are using an AioContext
lock as acquired from *within* a BlockJob, excluding those that acquire
a context separately from a Job and use that to reason that accesses to
the job are safe (For example, blockdev_mark_auto_del.)
(1) QMP interface for job management
(2) bdrv_drain_all, in block/io.c
(1) AFAICT, the QMP interface is concerned with assuring itself it has
unique access to the BlockJob structure itself, and it doesn't really
authentically care about the AIOContext itself -- just race-free access
to the Job.
This is not necessarily buggy today because, even though we grab the
BlockBackend's context unconditionally, we already know the main/monitor
thread is not accessing the blockjob. It's still silly, though.
(2) bdrv_drain_all appears to be worried about the same thing; we just
need to safely deliver pause/resume messages.
I'm less sure about where this can run from, and suspect that if the job
has deferred to main that this could be buggy. If bdrv_drain_all is
called from context A and the job is running on context M having
deferred from B, we may lock against context B (from A) and have racey
access from between A/M. (Maybe?)
It looks like all of these usages don't actually really care what
context they're getting, just that they're getting safe access to the
job itself. If you want to decouple jobs from their BlockBackend's AIO
Context, perhaps we just need to replace these by a simple mutex...?
As it stands, though, pushing AIOContext acquisitions down into
blockjob.c isn't actually going to fix anything, is it? Why not just
leave it be for right now?
Would you be terribly offended if I left this patch as-is for now and we
can work on removing the AioContext locks afterwards, or are you adamant
that I cooperate in getting block_job_get_aio_context removed before I
add more usages?
Sorry I'm being so obtuse about this all.
--js