[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead o
From: |
Max Reitz |
Subject: |
Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers |
Date: |
Mon, 28 Jan 2019 16:59:40 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 |
On 28.01.19 12:29, Vladimir Sementsov-Ogievskiy wrote:
> 18.01.2019 17:56, Max Reitz wrote:
>> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
>>> Drop write notifiers and use filter node instead. Changes:
>>>
>>> 1. copy-before-writes now handled by filter node, so, drop all
>>> is_write_notifier arguments.
>>>
>>> 2. we don't have intersecting requests, so their handling is dropped.
>>> Instead, synchronization works as follows:
>>> when backup or backup-top starts copying of some area it firstly
>>> clears copy-bitmap bits, and nobody touches areas, not marked with
>>> dirty bits in copy-bitmap, so there is no intersection. Also, backup
>>> job copy operations are surrounded by bdrv region lock, which is
>>> actually serializing request, to not interfer with guest writes and
>>> not read changed data from source (before reading we clear
>>> corresponding bit in copy-bitmap, so, this area is not more handled by
>>> backup-top).
>>>
>>> 3. To sync with in-flight requests we now just drain hook node, we
>>> don't need rw-lock.
>>>
>>> 4. After the whole backup loop (top, full, incremental modes), we need
>>> to check for not copied clusters, which were under backup-top operation
>>> and we skipped them, but backup-top operation failed, error returned to
>>> the guest and dirty bits set back.
>>>
>>> 5. Don't create additional blk, use backup-top children for copy
>>> operations.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>>> ---
>>> block/backup.c | 285 ++++++++++++++++++++++++++-----------------------
>>> 1 file changed, 149 insertions(+), 136 deletions(-)
>>>
>>> diff --git a/block/backup.c b/block/backup.c
>>> index 88c0242b4e..e332909fb7 100644
>>> --- a/block/backup.c
>>> +++ b/block/backup.c
>>
>> [...]
>>
>>> @@ -300,21 +231,23 @@ static void backup_abort(Job *job)
>>> static void backup_clean(Job *job)
>>> {
>>> BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
>>> - assert(s->target);
>>> - blk_unref(s->target);
>>> +
>>> + /* We must clean it to not crash in backup_drain. */
>>> s->target = NULL;
>>
>> Why not set s->source to NULL along with it? It makes sense if you're
>> going to drop the backup-top node because both of these are its children.
>
> agree.
>
>>
>>>
>>> if (s->copy_bitmap) {
>>> hbitmap_free(s->copy_bitmap);
>>> s->copy_bitmap = NULL;
>>> }
>>> +
>>> + bdrv_backup_top_drop(s->backup_top);
>>> }
>>
>> [...]
>>
>>> @@ -386,21 +319,45 @@ static int coroutine_fn
>>> backup_run_incremental(BackupBlockJob *job)
>>> bool error_is_read;
>>> int64_t offset;
>>> HBitmapIter hbi;
>>> + void *lock = NULL;
>>>
>>> hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>> - while ((offset = hbitmap_iter_next(&hbi)) != -1) {
>>> + while (hbitmap_count(job->copy_bitmap)) {
>>> + offset = hbitmap_iter_next(&hbi);
>>> + if (offset == -1) {
>>> + /*
>>> + * we may have skipped some clusters, which were handled by
>>> + * backup-top, but failed and finished by returning error to
>>> + * the guest and set dirty bit back.
>>> + */
>>> + hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
>>> + offset = hbitmap_iter_next(&hbi);
>>> + assert(offset);
>>
>> I think you want to assert "offset >= 0".
>>
>>> + }
>>> +
>>> + lock = bdrv_co_try_lock(job->source, offset, job->cluster_size);
>>> + /*
>>> + * Dirty bit is set, which means that there are no in-flight
>>> + * write requests on this area. We must succeed.
>>> + */
>>> + assert(lock);
>>
>> I'm not sure that is true right now, but more on that below in backup_run().
>>
>>> +
>>> do {
>>> if (yield_and_check(job)) {
>>> + bdrv_co_unlock(lock);
>>> return 0;
>>> }
>>> - ret = backup_do_cow(job, offset,
>>> - job->cluster_size, &error_is_read, false);
>>> + ret = backup_do_cow(job, offset, job->cluster_size,
>>> &error_is_read);
>>> if (ret < 0 && backup_error_action(job, error_is_read, -ret)
>>> ==
>>> BLOCK_ERROR_ACTION_REPORT)
>>> {
>>> + bdrv_co_unlock(lock);
>>> return ret;
>>> }
>>> } while (ret < 0);
>>> +
>>> + bdrv_co_unlock(lock);
>>> + lock = NULL;
>>
>> This statement seems unnecessary here.
>>
>>> }
>>>
>>> return 0;
>>
>> [...]
>>
>>> @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Error
>>> **errp)
>>> hbitmap_set(s->copy_bitmap, 0, s->len);
>>> }
>>>
>>> - s->before_write.notify = backup_before_write_notify;
>>> - bdrv_add_before_write_notifier(bs, &s->before_write);
>>> -
>>> if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
>>> /* All bits are set in copy_bitmap to allow any cluster to be
>>> copied.
>>> * This does not actually require them to be copied. */
>>> while (!job_is_cancelled(job)) {
>>> - /* Yield until the job is cancelled. We just let our
>>> before_write
>>> - * notify callback service CoW requests. */
>>> + /*
>>> + * Yield until the job is cancelled. We just let our
>>> backup-top
>>> + * fileter driver service CbW requests.
>>
>> *filter
>>
>>> + */
>>> job_yield(job);
>>> }
>>> } else if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>>> ret = backup_run_incremental(s);
>>> } else {
>>
>> [...]
>>
>>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error
>>> **errp)
>>> if (alloced < 0) {
>>> ret = alloced;
>>> } else {
>>> + if (!hbitmap_get(s->copy_bitmap, offset)) {
>>> + trace_backup_do_cow_skip(job, offset);
>>> + continue; /* already copied */
>>> + }
>>> + if (!lock) {
>>> + lock = bdrv_co_try_lock(s->source, offset,
>>> s->cluster_size);
>>> + /*
>>> + * Dirty bit is set, which means that there are no
>>> in-flight
>>> + * write requests on this area. We must succeed.
>>> + */
>>> + assert(lock);
>>
>> What if I have a different parent node for the source that issues
>> concurrent writes? This used to work fine because the before_write
>> notifier would still work. After this patch, that would be broken
>> because those writes would not cause a CbW.
>
> But haw could you have this different parent node? After appending filter,
> there should not be such nodes.
Unless you append them afterwards:
> And I think, during backup it should be
> forbidden to append new parents to source, ignoring filter, as it definitely
> breaks what filter does.
Agreed, but then this needs to be implemented.
> And it applies to other block-job with their filters.
> If we appended a filter, we don't want someone other to write omit our filter.
>
>>
>> That's not so bad because we just have to make sure that all writes go
>> through the backup-top node. That would make this assertion valid
>> again, too. But that means we cannot share PERM_WRITE; see [1].
>
> But we don't share PERM_WRITE on source in backup_top, only on target.
Are you sure? The job itself shares it, and the filter shares it, too,
as far as I can see. It uses bdrv_filter_default_perms(), and that does
seem to share PERM_WRITE.
>>
>>> + }
>>> ret = backup_do_cow(s, offset, s->cluster_size,
>>> - &error_is_read, false);
>>> + &error_is_read);
>>> }
>>> if (ret < 0) {
>>> /* Depending on error action, fail now or retry cluster */
>>> @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Error
>>> **errp)
>>> break;
>>> } else {
>>> offset -= s->cluster_size;
>>> + retry = true;
>>> continue;
>>> }
>>> }
>>> }
>>> + if (lock) {
>>> + bdrv_co_unlock(lock);
>>> + lock = NULL;
>>> + }
>>> + if (ret == 0 && !job_is_cancelled(job) &&
>>> + hbitmap_count(s->copy_bitmap))
>>> + {
>>> + /*
>>> + * we may have skipped some clusters, which were handled by
>>> + * backup-top, but failed and finished by returning error to
>>> + * the guest and set dirty bit back.
>>
>> So it's a matter of a race?
>>
>>> + */
>>> + goto iteration;
>>> + }
>>
>> Why not wrap everything in a do {} while (ret == 0 && !job_is...)
>> instead? Because it would mean reindenting everything?
>
> Don't remember, but assume that yes. And this code is anyway "To be
> refactored",
> I want all FULL/TOP/INCREMENTAL go through the same (mostly) code path.
Hm, well, if you want to refactor it later anyway... But I don't like
gotos that go backwards very much, unless there is a good reason to have
them (and there isn't here).
>>> }
>>>
>>> - notifier_with_return_remove(&s->before_write);
>>> + /* wait pending CBW operations in backup-top */
>>> + bdrv_drain(s->backup_top);
>>>
>>> - /* wait until pending backup_do_cow() calls have completed */
>>> - qemu_co_rwlock_wrlock(&s->flush_rwlock);
>>> - qemu_co_rwlock_unlock(&s->flush_rwlock);
>>> + backup_top_progress = bdrv_backup_top_progress(s->backup_top);
>>> + job_progress_update(job, ret + backup_top_progress -
>>
>> Why the "ret"?
>
> oops, it looks like a copy-paste bug ("ret" is reasonable in backup_do_cow())
>
>>
>>> + s->backup_top_progress);
>>> + s->backup_top_progress = backup_top_progress;
>>
>> So the backup-top progress is ignored during basically all of the block
>> job until it suddenly jumps to 100 % completion? That doesn't sound ideal.
>>
>> Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL?
>> (And the while() loop of MODE_NONE)
>
>
> It is done in backup_do_cow(), so FULL and TOP are covered.
>
> But you are right that MODE_NONE seems to have a problem about it.. And just
> updating it
> in a while loop would not work, as I doubt that job_yield will return until
> job finish
> or user interaction like pause/continue/cancel..
>
> So now, it looks better to call job_progress_update() from backup_top
> directly, and drop
> this hack.
Hmmm... I don't think job_*() calls belong in backup_top. How about
adding a callback to bdrv_backup_top_append()?
>>>
>>> return ret;
>>> }
>>> @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id,
>>> BlockDriverState *bs,
>>> int ret;
>>> int64_t cluster_size;
>>> HBitmap *copy_bitmap = NULL;
>>> + BlockDriverState *backup_top = NULL;
>>> + uint64_t all_except_resize = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE
>>> |
>>> + BLK_PERM_WRITE_UNCHANGED |
>>> BLK_PERM_GRAPH_MOD;
>>
>> BLK_PERM_ALL & ~BLK_PERM_RESIZE?
>>
>> [1] But we probably do not want to share either PERM_WRITE or
>> PERM_WRITE_UNCHANGED because during the duration of the backup,
>> everything should go through the backup-top filter (not sure about
>> PERM_WRITE_UNCHANGED right now). Or is that something that the filter
>> node should enforce in backup_top_child_perm()?
>
> It's not shared perm of backup_top, it's a shared perm of block-job
> common.blk, which is
> used only to "job->len is fixed, so we can't allow resize", so this part is
> not changed.
>
> So yes, the problem you mean by [1] is about backup_top_child_perm() where we
> share PERM_WRITE.
> And it is described by comment, we must share this write perm, otherwise we
> break guest writes.
For the target, yes, but the problem is sharing it on the source.
> We share PERM_WRITE in backup_top to force its target child share PERM_WRITE
> on its backing,
> as backing of target is source.
>
> But again, we share PERM_WRITE only on target, and it is shared in current
> code too.
I'm not so sure whether PERM_WRITE is shared only on the target.
>>
>>>
>>> assert(bs);
>>> assert(target);
>>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id,
>>> BlockDriverState *bs,
>>>
>>> copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
>>>
>>> - /* job->len is fixed, so we can't allow resize */
>>> - job = block_job_create(job_id, &backup_job_driver, txn, bs,
>>> - BLK_PERM_CONSISTENT_READ,
>>> - BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>> - BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD,
>>> - speed, creation_flags, cb, opaque, errp);
>>> - if (!job) {
>>> + /*
>>> + * bdrv_get_device_name will not help to find device name starting from
>>> + * @bs after backup-top append,
>>
>> Why not? Since backup-top is appended, shouldn't all parents of @bs be
>> parents of @backup_top then? (Making bdrv_get_parent_name() return the
>> same result)
>
> bdrv_get_device_name goes finally through role->get_name, and only root role
> has
> this handler. After append we'll have backing role instead of root.
Ah, I see, I asked the wrong question.
Why is block_job_create() called on bs and not on backup_top? mirror
calls it on mirror_top_bs.
>>> so let's calculate job_id before. Do
>>> + * it in the same way like block_job_create
>>> + */
>>> + if (job_id == NULL && !(creation_flags & JOB_INTERNAL)) {
>>> + job_id = bdrv_get_device_name(bs);
>>> + }
>>> +
>>> + backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp);
>>> + if (!backup_top) {
>>> goto error;
>>> }
>>>
>>> - /* The target must match the source in size, so no resize here either
>>> */
>>> - job->target = blk_new(BLK_PERM_WRITE,
>>> - BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
>>> - BLK_PERM_WRITE_UNCHANGED | BLK_PERM_GRAPH_MOD);
>>> - ret = blk_insert_bs(job->target, target, errp);
>>> - if (ret < 0) {
>>> + /* job->len is fixed, so we can't allow resize */
>>> + job = block_job_create(job_id, &backup_job_driver, txn, bs, 0,
>>
>> Is there a reason you dropped PERM_CONSISTENT_READ here?
>
> Because, we don't use this blk for read now, we read through backup_top child.
Makes sense.
>>> + all_except_resize, speed, creation_flags,
>>> + cb, opaque, errp);
>>> + if (!job) {
>>> goto error;
>>> }
>>>
>>> + job->source = backup_top->backing;
>>> + job->target = ((BDRVBackupTopState *)backup_top->opaque)->target;
>>
>> This looks really ugly. I think as long as the block job performs
>> writes itself, it should use its own BlockBackend.
>
> They are not BlockBackends, they are BdrvChildren.
Exactly, which is what I don't like. They are children of a node that
is implemented in a different file, it looks weird to use them here.
> It was Kevin's idea to reuse filter's
> children in backup job:
>
> https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01017.html
It's still ugly if backup_top is in a different file. Well, maybe just
to me.
> Hmm, and this is also why I need PERM_WRITE in backup_top, to write to target.
>
>>
>> Alternatively, I think it would make sense for the backup-top filter to
>> offer functions that this job can use to issue writes to the target.
>> Then the backup job would no longer need direct access to the target as
>> a BdrvChild.
So what would be the problem with this?
Max
signature.asc
Description: OpenPGP digital signature