Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead o

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead o

From:	Vladimir Sementsov-Ogievskiy
Subject:	Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Date:	Mon, 28 Jan 2019 19:00:55 +0000
28 янв. 2019 г. 20:40 пользователь Kevin Wolf <address@hidden> написал:

Am 28.01.2019 um 17:44 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 28.01.2019 18:59, Max Reitz wrote:
> > On 28.01.19 12:29, Vladimir Sementsov-Ogievskiy wrote:
> >> 18.01.2019 17:56, Max Reitz wrote:
> >>> On 29.12.18 13:20, Vladimir Sementsov-Ogievskiy wrote:
> >>>> Drop write notifiers and use filter node instead. Changes:
> >>>>
> >>>> 1. copy-before-writes now handled by filter node, so, drop all
> >>>>      is_write_notifier arguments.
> >>>>
> >>>> 2. we don't have intersecting requests, so their handling is dropped.
> >>>> Instead, synchronization works as follows:
> >>>> when backup or backup-top starts copying of some area it firstly
> >>>> clears copy-bitmap bits, and nobody touches areas, not marked with
> >>>> dirty bits in copy-bitmap, so there is no intersection. Also, backup
> >>>> job copy operations are surrounded by bdrv region lock, which is
> >>>> actually serializing request, to not interfer with guest writes and
> >>>> not read changed data from source (before reading we clear
> >>>> corresponding bit in copy-bitmap, so, this area is not more handled by
> >>>> backup-top).
> >>>>
> >>>> 3. To sync with in-flight requests we now just drain hook node, we
> >>>> don't need rw-lock.
> >>>>
> >>>> 4. After the whole backup loop (top, full, incremental modes), we need
> >>>> to check for not copied clusters, which were under backup-top operation
> >>>> and we skipped them, but backup-top operation failed, error returned to
> >>>> the guest and dirty bits set back.
> >>>>
> >>>> 5. Don't create additional blk, use backup-top children for copy
> >>>> operations.
> >>>>
> >>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
> >>>> ---
> >>>>    block/backup.c | 285 ++++++++++++++++++++++++++-----------------------
> >>>>    1 file changed, 149 insertions(+), 136 deletions(-)
> >>>>
> >>>> diff --git a/block/backup.c b/block/backup.c
> >>>> index 88c0242b4e..e332909fb7 100644
> >>>> --- a/block/backup.c
> >>>> +++ b/block/backup.c
> >>>
> >>> [...]
> >>>
> >>>> @@ -300,21 +231,23 @@ static void backup_abort(Job *job)
> >>>>    static void backup_clean(Job *job)
> >>>>    {
> >>>>        BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
> >>>> -    assert(s->target);
> >>>> -    blk_unref(s->target);
> >>>> +
> >>>> +    /* We must clean it to not crash in backup_drain. */
> >>>>        s->target = NULL;
> >>>
> >>> Why not set s->source to NULL along with it?  It makes sense if you're
> >>> going to drop the backup-top node because both of these are its children.
> >>
> >> agree.
> >>
> >>>
> >>>>
> >>>>        if (s->copy_bitmap) {
> >>>>            hbitmap_free(s->copy_bitmap);
> >>>>            s->copy_bitmap = NULL;
> >>>>        }
> >>>> +
> >>>> +    bdrv_backup_top_drop(s->backup_top);
> >>>>    }
> >>>
> >>> [...]
> >>>
> >>>> @@ -386,21 +319,45 @@ static int coroutine_fn 
> >>>> backup_run_incremental(BackupBlockJob *job)
> >>>>        bool error_is_read;
> >>>>        int64_t offset;
> >>>>        HBitmapIter hbi;
> >>>> +    void *lock = NULL;
> >>>>
> >>>>        hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
> >>>> -    while ((offset = hbitmap_iter_next(&hbi)) != -1) {
> >>>> +    while (hbitmap_count(job->copy_bitmap)) {
> >>>> +        offset = hbitmap_iter_next(&hbi);
> >>>> +        if (offset == -1) {
> >>>> +            /*
> >>>> +             * we may have skipped some clusters, which were handled by
> >>>> +             * backup-top, but failed and finished by returning error to
> >>>> +             * the guest and set dirty bit back.
> >>>> +             */
> >>>> +            hbitmap_iter_init(&hbi, job->copy_bitmap, 0);
> >>>> +            offset = hbitmap_iter_next(&hbi);
> >>>> +            assert(offset);
> >>>
> >>> I think you want to assert "offset >= 0".
> >>>
> >>>> +        }
> >>>> +
> >>>> +        lock = bdrv_co_try_lock(job->source, offset, job->cluster_size);
> >>>> +        /*
> >>>> +         * Dirty bit is set, which means that there are no in-flight
> >>>> +         * write requests on this area. We must succeed.
> >>>> +         */
> >>>> +        assert(lock);
> >>>
> >>> I'm not sure that is true right now, but more on that below in 
> >>> backup_run().
> >>>
> >>>> +
> >>>>            do {
> >>>>                if (yield_and_check(job)) {
> >>>> +                bdrv_co_unlock(lock);
> >>>>                    return 0;
> >>>>                }
> >>>> -            ret = backup_do_cow(job, offset,
> >>>> -                                job->cluster_size, &error_is_read, 
> >>>> false);
> >>>> +            ret = backup_do_cow(job, offset, job->cluster_size, 
> >>>> &error_is_read);
> >>>>                if (ret < 0 && backup_error_action(job, error_is_read, 
> >>>> -ret) ==
> >>>>                               BLOCK_ERROR_ACTION_REPORT)
> >>>>                {
> >>>> +                bdrv_co_unlock(lock);
> >>>>                    return ret;
> >>>>                }
> >>>>            } while (ret < 0);
> >>>> +
> >>>> +        bdrv_co_unlock(lock);
> >>>> +        lock = NULL;
> >>>
> >>> This statement seems unnecessary here.
> >>>
> >>>>        }
> >>>>
> >>>>        return 0;
> >>>
> >>> [...]
> >>>
> >>>> @@ -447,26 +402,39 @@ static int coroutine_fn backup_run(Job *job, Error 
> >>>> **errp)
> >>>>            hbitmap_set(s->copy_bitmap, 0, s->len);
> >>>>        }
> >>>>
> >>>> -    s->before_write.notify = backup_before_write_notify;
> >>>> -    bdrv_add_before_write_notifier(bs, &s->before_write);
> >>>> -
> >>>>        if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
> >>>>            /* All bits are set in copy_bitmap to allow any cluster to be 
> >>>> copied.
> >>>>             * This does not actually require them to be copied. */
> >>>>            while (!job_is_cancelled(job)) {
> >>>> -            /* Yield until the job is cancelled.  We just let our 
> >>>> before_write
> >>>> -             * notify callback service CoW requests. */
> >>>> +            /*
> >>>> +             * Yield until the job is cancelled.  We just let our 
> >>>> backup-top
> >>>> +             * fileter driver service CbW requests.
> >>>
> >>> *filter
> >>>
> >>>> +             */
> >>>>                job_yield(job);
> >>>>            }
> >>>>        } else if (s->sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
> >>>>            ret = backup_run_incremental(s);
> >>>>        } else {
> >>>
> >>> [...]
> >>>
> >>>> @@ -505,8 +474,20 @@ static int coroutine_fn backup_run(Job *job, Error 
> >>>> **errp)
> >>>>                if (alloced < 0) {
> >>>>                    ret = alloced;
> >>>>                } else {
> >>>> +                if (!hbitmap_get(s->copy_bitmap, offset)) {
> >>>> +                    trace_backup_do_cow_skip(job, offset);
> >>>> +                    continue; /* already copied */
> >>>> +                }
> >>>> +                if (!lock) {
> >>>> +                    lock = bdrv_co_try_lock(s->source, offset, 
> >>>> s->cluster_size);
> >>>> +                    /*
> >>>> +                     * Dirty bit is set, which means that there are no 
> >>>> in-flight
> >>>> +                     * write requests on this area. We must succeed.
> >>>> +                     */
> >>>> +                    assert(lock);
> >>>
> >>> What if I have a different parent node for the source that issues
> >>> concurrent writes?  This used to work fine because the before_write
> >>> notifier would still work.  After this patch, that would be broken
> >>> because those writes would not cause a CbW.
> >>
> >> But haw could you have this different parent node? After appending filter,
> >> there should not be such nodes.
> >
> > Unless you append them afterwards:
> >
> >> And I think, during backup it should be
> >> forbidden to append new parents to source, ignoring filter, as it 
> >> definitely
> >> breaks what filter does.
> >
> > Agreed, but then this needs to be implemented.
> >
> >> And it applies to other block-job with their filters.
> >> If we appended a filter, we don't want someone other to write omit our 
> >> filter.
> >>
> >>>
> >>> That's not so bad because we just have to make sure that all writes go
> >>> through the backup-top node.  That would make this assertion valid
> >>> again, too.  But that means we cannot share PERM_WRITE; see [1].
> >>
> >> But we don't share PERM_WRITE on source in backup_top, only on target.
> >
> > Are you sure?  The job itself shares it, and the filter shares it, too,
> > as far as I can see.  It uses bdrv_filter_default_perms(), and that does
> > seem to share PERM_WRITE.
>
> And in bdrv_Filter_default_perms it does "*nshared = *nshared | 
> BLK_PERM_WRITE"
> only for child_file, it is target. Source is child_backing.
>
> >
> >>>
> >>>> +                }
> >>>>                    ret = backup_do_cow(s, offset, s->cluster_size,
> >>>> -                                    &error_is_read, false);
> >>>> +                                    &error_is_read);
> >>>>                }
> >>>>                if (ret < 0) {
> >>>>                    /* Depending on error action, fail now or retry 
> >>>> cluster */
> >>>> @@ -516,17 +497,34 @@ static int coroutine_fn backup_run(Job *job, Error 
> >>>> **errp)
> >>>>                        break;
> >>>>                    } else {
> >>>>                        offset -= s->cluster_size;
> >>>> +                    retry = true;
> >>>>                        continue;
> >>>>                    }
> >>>>                }
> >>>>            }
> >>>> +        if (lock) {
> >>>> +            bdrv_co_unlock(lock);
> >>>> +            lock = NULL;
> >>>> +        }
> >>>> +        if (ret == 0 && !job_is_cancelled(job) &&
> >>>> +            hbitmap_count(s->copy_bitmap))
> >>>> +        {
> >>>> +            /*
> >>>> +             * we may have skipped some clusters, which were handled by
> >>>> +             * backup-top, but failed and finished by returning error to
> >>>> +             * the guest and set dirty bit back.
> >>>
> >>> So it's a matter of a race?
> >>>
> >>>> +             */
> >>>> +            goto iteration;
> >>>> +        }
> >>>
> >>> Why not wrap everything in a do {} while (ret == 0 && !job_is...)
> >>> instead?  Because it would mean reindenting everything?
> >>
> >> Don't remember, but assume that yes. And this code is anyway "To be 
> >> refactored",
> >> I want all FULL/TOP/INCREMENTAL go through the same (mostly) code path.
> >
> > Hm, well, if you want to refactor it later anyway...  But I don't like
> > gotos that go backwards very much, unless there is a good reason to have
> > them (and there isn't here).
>
>
> Ok, not a real problem. Let's go on with do-while.
>
>
> >
> >>>>        }
> >>>>
> >>>> -    notifier_with_return_remove(&s->before_write);
> >>>> +    /* wait pending CBW operations in backup-top */
> >>>> +    bdrv_drain(s->backup_top);
> >>>>
> >>>> -    /* wait until pending backup_do_cow() calls have completed */
> >>>> -    qemu_co_rwlock_wrlock(&s->flush_rwlock);
> >>>> -    qemu_co_rwlock_unlock(&s->flush_rwlock);
> >>>> +    backup_top_progress = bdrv_backup_top_progress(s->backup_top);
> >>>> +    job_progress_update(job, ret + backup_top_progress -
> >>>
> >>> Why the "ret"?
> >>
> >> oops, it looks like a copy-paste bug ("ret" is reasonable in 
> >> backup_do_cow())
> >>
> >>>
> >>>> +                        s->backup_top_progress);
> >>>> +    s->backup_top_progress = backup_top_progress;
> >>>
> >>> So the backup-top progress is ignored during basically all of the block
> >>> job until it suddenly jumps to 100 % completion?  That doesn't sound 
> >>> ideal.
> >>>
> >>> Or did you mean to put this into the for () loop of MODE_TOP/MODE_FULL?
> >>>    (And the while() loop of MODE_NONE)
> >>
> >>
> >> It is done in backup_do_cow(), so FULL and TOP are covered.
> >>
> >> But you are right that MODE_NONE seems to have a problem about it.. And 
> >> just updating it
> >> in a while loop would not work, as I doubt that job_yield will return 
> >> until job finish
> >> or user interaction like pause/continue/cancel..
> >>
> >> So now, it looks better to call job_progress_update() from backup_top 
> >> directly, and drop
> >> this hack.
> >
> > Hmmm...  I don't think job_*() calls belong in backup_top.  How about
> > adding a callback to bdrv_backup_top_append()?
>
> Ok for me at least as a temporary step.
>
> >
> >>>>
> >>>>        return ret;
> >>>>    }
> >>>> @@ -563,6 +561,9 @@ BlockJob *backup_job_create(const char *job_id, 
> >>>> BlockDriverState *bs,
> >>>>        int ret;
> >>>>        int64_t cluster_size;
> >>>>        HBitmap *copy_bitmap = NULL;
> >>>> +    BlockDriverState *backup_top = NULL;
> >>>> +    uint64_t all_except_resize = BLK_PERM_CONSISTENT_READ | 
> >>>> BLK_PERM_WRITE |
> >>>> +                                 BLK_PERM_WRITE_UNCHANGED | 
> >>>> BLK_PERM_GRAPH_MOD;
> >>>
> >>> BLK_PERM_ALL & ~BLK_PERM_RESIZE?
> >>>
> >>> [1] But we probably do not want to share either PERM_WRITE or
> >>> PERM_WRITE_UNCHANGED because during the duration of the backup,
> >>> everything should go through the backup-top filter (not sure about
> >>> PERM_WRITE_UNCHANGED right now).  Or is that something that the filter
> >>> node should enforce in backup_top_child_perm()?
> >>
> >> It's not shared perm of backup_top, it's a shared perm of block-job 
> >> common.blk, which is
> >> used only to "job->len is fixed, so we can't allow resize", so this part 
> >> is not changed.
> >>
> >> So yes, the problem you mean by [1] is about backup_top_child_perm() where 
> >> we share PERM_WRITE.
> >> And it is described by comment, we must share this write perm, otherwise 
> >> we break guest writes.
> >
> > For the target, yes, but the problem is sharing it on the source.
> >
> >> We share PERM_WRITE in backup_top to force its target child share 
> >> PERM_WRITE on its backing,
> >> as backing of target is source.
> >>
> >> But again, we share PERM_WRITE only on target, and it is shared in current 
> >> code too.
> >
> > I'm not so sure whether PERM_WRITE is shared only on the target.
>
> Only on target, as child_file is target.
>
> >
> >>>
> >>>>
> >>>>        assert(bs);
> >>>>        assert(target);
> >>>> @@ -655,25 +656,31 @@ BlockJob *backup_job_create(const char *job_id, 
> >>>> BlockDriverState *bs,
> >>>>
> >>>>        copy_bitmap = hbitmap_alloc(len, ctz32(cluster_size));
> >>>>
> >>>> -    /* job->len is fixed, so we can't allow resize */
> >>>> -    job = block_job_create(job_id, &backup_job_driver, txn, bs,
> >>>> -                           BLK_PERM_CONSISTENT_READ,
> >>>> -                           BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
> >>>> -                           BLK_PERM_WRITE_UNCHANGED | 
> >>>> BLK_PERM_GRAPH_MOD,
> >>>> -                           speed, creation_flags, cb, opaque, errp);
> >>>> -    if (!job) {
> >>>> +    /*
> >>>> +     * bdrv_get_device_name will not help to find device name starting 
> >>>> from
> >>>> +     * @bs after backup-top append,
> >>>
> >>> Why not?  Since backup-top is appended, shouldn't all parents of @bs be
> >>> parents of @backup_top then?  (Making bdrv_get_parent_name() return the
> >>> same result)
> >>
> >> bdrv_get_device_name goes finally through role->get_name, and only root 
> >> role has
> >> this handler. After append we'll have backing role instead of root.
> >
> > Ah, I see, I asked the wrong question.
> >
> > Why is block_job_create() called on bs and not on backup_top?  mirror
> > calls it on mirror_top_bs.
>
> Good question. I don't exactly remember, may be there are were more troubles 
> with
> permissions or somthing. So, I've to try it again..
>
> What is more beneficial?
>
> My current approach, is that job and filter are two sibling users of source 
> node,
> they do copying, they are synchronized. And in this way, it is better to read 
> from
> source directly, to not create extra intersection between job and filter..
>
> On the other hand, if we read through the filter, we possible should do the 
> whole
> copy operation through the filter..
>
> What is the difference between guest read and backup-job read, in filter POV? 
> I think:
>
> For guest read, filter MUST read (as we must handle guest request), and than, 
> if
> we don't have too much in-flight requests, ram-cache is not full, etc, we can 
> handle
> already read data in terms of backup, so, copy it somewhere. Or we can drop 
> it, if
> we can't handle it at the moment..
>
> For job read, we even MAY not read, if our queues are full, postponing job 
> request.
>
> So
>
> Guest read: MUST read, MAY backup
> Job read: MAY read and backup
>
> So, reading through filter has a possibility of common code path + native 
> prioritization
> of copy operations. This of course will need more refactoring of backup, and 
> may be done
> as a separate step, but definitely, I have to at least try to create job 
> above the filter.
>
> >
> >>>>                                        so let's calculate job_id before. 
> >>>> Do
> >>>> +     * it in the same way like block_job_create
> >>>> +     */
> >>>> +    if (job_id == NULL && !(creation_flags & JOB_INTERNAL)) {
> >>>> +        job_id = bdrv_get_device_name(bs);
> >>>> +    }
> >>>> +
> >>>> +    backup_top = bdrv_backup_top_append(bs, target, copy_bitmap, errp);
> >>>> +    if (!backup_top) {
> >>>>            goto error;
> >>>>        }
> >>>>
> >>>> -    /* The target must match the source in size, so no resize here 
> >>>> either */
> >>>> -    job->target = blk_new(BLK_PERM_WRITE,
> >>>> -                          BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE |
> >>>> -                          BLK_PERM_WRITE_UNCHANGED | 
> >>>> BLK_PERM_GRAPH_MOD);
> >>>> -    ret = blk_insert_bs(job->target, target, errp);
> >>>> -    if (ret < 0) {
> >>>> +    /* job->len is fixed, so we can't allow resize */
> >>>> +    job = block_job_create(job_id, &backup_job_driver, txn, bs, 0,
> >>>
> >>> Is there a reason you dropped PERM_CONSISTENT_READ here?
> >>
> >> Because, we don't use this blk for read now, we read through backup_top 
> >> child.
> >
> > Makes sense.
> >
> >>>> +                           all_except_resize, speed, creation_flags,
> >>>> +                           cb, opaque, errp);
> >>>> +    if (!job) {
> >>>>            goto error;
> >>>>        }
> >>>>
> >>>> +    job->source = backup_top->backing;
> >>>> +    job->target = ((BDRVBackupTopState *)backup_top->opaque)->target;
> >>>
> >>> This looks really ugly.  I think as long as the block job performs
> >>> writes itself, it should use its own BlockBackend.
> >>
> >> They are not BlockBackends, they are BdrvChildren.
> >
> > Exactly, which is what I don't like.  They are children of a node that
> > is implemented in a different file, it looks weird to use them here.
> >
> >> It was Kevin's idea to reuse filter's
> >> children in backup job:
> >>
> >> https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01017.html
> >
> > It's still ugly if backup_top is in a different file.  Well, maybe just
> > to me.
> >
> >> Hmm, and this is also why I need PERM_WRITE in backup_top, to write to 
> >> target.
> >>
> >>>
> >>> Alternatively, I think it would make sense for the backup-top filter to
> >>> offer functions that this job can use to issue writes to the target.
> >>> Then the backup job would no longer need direct access to the target as
> >>> a BdrvChild.
> >
> > So what would be the problem with this?
>
> I have no specific arguments against, I'll try. Kevin, do have comments on 
> this?

I haven't really followed this thread recently because there was other
stuff that needed my attention and you seemed to have a good discussion
with Max. If you think my input would be important at this point, I can
try to read up on it tomorrow.

Kevin

Hmm. Ok, I think I now have enough notes from Max to think and to exirement, 
so, you can safely skip it for now.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Max Reitz, 2019/01/18
- Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Vladimir Sementsov-Ogievskiy, 2019/01/28
  - Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Max Reitz, 2019/01/28
    - Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Vladimir Sementsov-Ogievskiy, 2019/01/28
    - Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Max Reitz, 2019/01/28
    - Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Vladimir Sementsov-Ogievskiy, 2019/01/28
    - Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Kevin Wolf, 2019/01/28
    - Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers, Vladimir Sementsov-Ogievskiy <=
Prev by Date: Re: [Qemu-devel] [PULL 00/23] tcg queued patches
Next by Date: Re: [Qemu-devel] [PATCH] tests/microbit-test: Add tests for nRF51 NVMC
Previous by thread: Re: [Qemu-devel] [PATCH v5 11/11] block/backup: use backup-top instead of write notifiers
Next by thread: [Qemu-devel] [PULL 00/49] target-arm queue
Index(es):
- Date
- Thread