qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functi


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [Qemu-devel] [PATCH v4 02/11] block: Filtered children access functions
Date: Fri, 24 May 2019 08:12:50 +0000

23.05.2019 20:27, Max Reitz wrote:
> On 17.05.19 16:50, Vladimir Sementsov-Ogievskiy wrote:
>> 10.04.2019 23:20, Max Reitz wrote:
>>> What bs->file and bs->backing mean depends on the node.  For filter
>>> nodes, both signify a node that will eventually receive all R/W
>>> accesses.  For format nodes, bs->file contains metadata and data, and
>>> bs->backing will not receive writes -- instead, writes are COWed to
>>> bs->file.  Usually.
>>>
>>> In any case, it is not trivial to guess what a child means exactly with
>>> our currently limited form of expression.  It is better to introduce
>>> some functions that actually guarantee a meaning:
>>>
>>> - bdrv_filtered_cow_child() will return the child that receives requests
>>>     filtered through COW.  That is, reads may or may not be forwarded
>>>     (depending on the overlay's allocation status), but writes never go to
>>>     this child.
>>>
>>> - bdrv_filtered_rw_child() will return the child that receives requests
>>>     filtered through some very plain process.  Reads and writes issued to
>>>     the parent will go to the child as well (although timing, etc. may be
>>>     modified).
>>>
>>> - All drivers but quorum (but quorum is pretty opaque to the general
>>>     block layer anyway) always only have one of these children: All read
>>>     requests must be served from the filtered_rw_child (if it exists), so
>>>     if there was a filtered_cow_child in addition, it would not receive
>>>     any requests at all.
>>>     (The closest here is mirror, where all requests are passed on to the
>>>     source, but with write-blocking, write requests are "COWed" to the
>>>     target.  But that just means that the target is a special child that
>>>     cannot be introspected by the generic block layer functions, and that
>>>     source is a filtered_rw_child.)
>>>     Therefore, we can also add bdrv_filtered_child() which returns that
>>>     one child (or NULL, if there is no filtered child).
>>>
>>> Also, many places in the current block layer should be skipping filters
>>> (all filters or just the ones added implicitly, it depends) when going
>>> through a block node chain.  They do not do that currently, but this
>>> patch makes them.
>>>
>>> One example for this is qemu-img map, which should skip filters and only
>>> look at the COW elements in the graph.  The change to iotest 204's
>>> reference output shows how using blkdebug on top of a COW node used to
>>> make qemu-img map disregard the rest of the backing chain, but with this
>>> patch, the allocation in the base image is reported correctly.
>>>
>>> Furthermore, a note should be made that sometimes we do want to access
>>> bs->backing directly.  This is whenever the operation in question is not
>>> about accessing the COW child, but the "backing" child, be it COW or
>>> not.  This is the case in functions such as bdrv_open_backing_file() or
>>> whenever we have to deal with the special behavior of @backing as a
>>> blockdev option, which is that it does not default to null like all
>>> other child references do.
>>>
>>> Finally, the query functions (query-block and query-named-block-nodes)
>>> are modified to return any filtered child under "backing", not just
>>> bs->backing or COW children.  This is so that filters do not interrupt
>>> the reported backing chain.  This changes the output of iotest 184, as
>>> the throttled node now appears as a backing child.
>>>
>>> Signed-off-by: Max Reitz <address@hidden>
>>> ---
>>
>> [..]
>>
>>> --- a/block/mirror.c
>>> +++ b/block/mirror.c
> 
> [...]
> 
>>> @@ -1650,7 +1651,9 @@ static void mirror_start_job(const char *job_id, 
>>> BlockDriverState *bs,
>>>         * any jobs in them must be blocked */
>>>        if (target_is_backing) {
>>>            BlockDriverState *iter;
>>> -        for (iter = backing_bs(bs); iter != target; iter = 
>>> backing_bs(iter)) {
>>> +        for (iter = bdrv_filtered_bs(bs); iter != target;
>>
>> should it be filtered_target too?
> 
> Hmm...  The comment says that all nodes that disappear must be blocked.
>   I don’t even know by heart which nodes I let disappear. :-/
> 
> I suppose we should start at the first explicit node, filter or not...?

Hm, I thought about where should we stop. But I don't think we want to remove 
nodes
under target, so it should be OK as is..

> 
>>> +             iter = bdrv_filtered_bs(iter))
>>> +        {
>>>                /* XXX BLK_PERM_WRITE needs to be allowed so we don't block
>>>                 * ourselves at s->base (if writes are blocked for a node, 
>>> they are
>>>                 * also blocked for its backing file). The other options 
>>> would be a
> 
> [...]
> 
>>> @@ -1707,14 +1710,14 @@ void mirror_start(const char *job_id, 
>>> BlockDriverState *bs,
>>>                      MirrorCopyMode copy_mode, Error **errp)
>>>    {
>>>        bool is_none_mode;
>>> -    BlockDriverState *base;
>>> +    BlockDriverState *base = NULL;
>>
>> dead assignment
> 
> Now I wonder why I even have that.  Probably an artifact from some
> intermediate point.
> 
>>>    
>>>        if (mode == MIRROR_SYNC_MODE_INCREMENTAL) {
>>>            error_setg(errp, "Sync mode 'incremental' not supported");
>>>            return;
>>>        }
>>>        is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
>>> -    base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL;
>>> +    base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : 
>>> NULL;
>>>        mirror_start_job(job_id, bs, creation_flags, target, replaces,
>>>                         speed, granularity, buf_size, backing_mode,
>>>                         on_source_error, on_target_error, unmap, NULL, NULL,
>>> diff --git a/block/qapi.c b/block/qapi.c
>>> index 110d05dc57..478c6f5e0d 100644
>>> --- a/block/qapi.c
>>> +++ b/block/qapi.c
> 
> [...]
> 
>>> @@ -535,9 +538,10 @@ static BlockStats 
>>> *bdrv_query_bds_stats(BlockDriverState *bs,
>>>            s->parent = bdrv_query_bds_stats(bs->file->bs, blk_level);
>>>        }
>>>    
>>> -    if (blk_level && bs->backing) {
>>> +    cow_bs = bdrv_filtered_cow_bs(bs);
>>
>> So, if we at blk_level and top bs is explicit filter, you don't want to show 
>> it's
>> child?
> 
> I do.  It’s in s->parent.  I thought it makes sense to change the
> existing bs->file vs. bs->backing to storage vs. COW.
Hmm. When I reviewed this I didn't consider the following patch. Actually, in 
this patch
you break showing backing child for filter, and in following - fix it. Not very 
good,
but not a reason to merge these two patches.. Ok for me.

> 
>> Hmm, at least, we can't show it if it is file-child, as qapi filed already 
>> called
>> backing. So, if we can't show for file-child-based filters, it may be better 
>> to not
>> show filter children here at all.
>>
>>> +    if (blk_level && cow_bs) {
>>>            s->has_backing = true;
>>> -        s->backing = bdrv_query_bds_stats(bs->backing->bs, blk_level);
>>> +        s->backing = bdrv_query_bds_stats(cow_bs, blk_level);
>>>        }
>>>    
>>>        return s;
>>> diff --git a/block/stream.c b/block/stream.c
>>> index bfaebb861a..23d5c890e0 100644
>>> --- a/block/stream.c
>>> +++ b/block/stream.c
>>> @@ -65,6 +65,7 @@ static int stream_prepare(Job *job)
>>>        StreamBlockJob *s = container_of(job, StreamBlockJob, common.job);
>>>        BlockJob *bjob = &s->common;
>>>        BlockDriverState *bs = blk_bs(bjob->blk);
>>> +    BlockDriverState *unfiltered = bdrv_skip_rw_filters(bs);
>>
>> Aha, I'd call it filtered, but unfiltered is correct too, it's amazing
> 
> Haha :-)
> 
> I think it’s all rather insane than amazing, but, well, insanity never
> ceases to amaze, does it.
> 
>>>        BlockDriverState *base = s->base;
>>>        Error *local_err = NULL;
>>>        int ret = 0;
>>> @@ -72,7 +73,7 @@ static int stream_prepare(Job *job)
> 
> [...]
> 
>>> @@ -121,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error 
>>> **errp)
>>>        int64_t n = 0; /* bytes */
>>>        void *buf;
>>>    
>>> -    if (!bs->backing) {
>>> +    if (!bdrv_filtered_child(bs)) {
>>>            goto out;
>>>        }
>>
>> this condition checks that there is nothing to stream, so, I thing it's 
>> better to check
>> if (!bdrv_backing_chain_next(bs)) {
>>     goto out;
>> }
> 
> Ah, sure.
> 
>>> @@ -162,7 +163,7 @@ static int coroutine_fn stream_run(Job *job, Error 
>>> **errp)
>>>            } else if (ret >= 0) {
>>>                /* Copy if allocated in the intermediate images.  Limit to 
>>> the
>>>                 * known-unallocated area [offset, 
>>> offset+n*BDRV_SECTOR_SIZE).  */
>>> -            ret = bdrv_is_allocated_above(backing_bs(bs), base,
>>> +            ret = bdrv_is_allocated_above(bdrv_filtered_bs(bs), base,
>>>                                              offset, n, &n);
>>
>> Hmm, if we trying to support bs to be filter, and actually operate on 
>> first-non-filter,
>> as you write in qapi spec, this is wrong. Again it should be
>> bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs))..
> 
> Would bdrv_backing_chain_next() fulfill the same purpose?  It can’t be
> allocated in a filter node, after all.

It's OK too.

> 
>> Or, may be better, we at stream start should calculate reald top bs to 
>> operate on, and
>> forget about all filters above.. i.e., do bs = bdrv_skip_rw_filters(bs) at 
>> the very
>> beginning, when creating a job.
> 
> Sounds reasonable.  We can ignore all the filters on top of the
> (un)filtered top anyway.
> 
>>>                /* Finish early if end of backing file has been reached */
>>> @@ -268,7 +269,9 @@ void stream_start(const char *job_id, BlockDriverState 
>>> *bs,
>>>         * disappear from the chain after this operation. The streaming job 
>>> reads
>>>         * every block only once, assuming that it doesn't change, so block 
>>> writes
>>>         * and resizes. */
>>> -    for (iter = backing_bs(bs); iter && iter != base; iter = 
>>> backing_bs(iter)) {
>>> +    for (iter = bdrv_filtered_bs(bs); iter && iter != base;
>>> +         iter = bdrv_filtered_bs(iter))
>>> +    {
>>>            block_job_add_bdrv(&s->common, "intermediate node", iter, 0,
>>>                               BLK_PERM_CONSISTENT_READ | 
>>> BLK_PERM_WRITE_UNCHANGED,
>>>                               &error_abort);
>>> diff --git a/blockdev.c b/blockdev.c
>>> index 4775a07d93..bb71b8368d 100644
>>> --- a/blockdev.c
>>> +++ b/blockdev.c
>>> @@ -1094,7 +1094,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
>>>                return;
>>>            }
>>>    
>>> -        bs = blk_bs(blk);
>>> +        bs = bdrv_skip_implicit_filters(blk_bs(blk));
>>>            aio_context = bdrv_get_aio_context(bs);
>>>            aio_context_acquire(aio_context);
>>>    
>>> @@ -1663,7 +1663,7 @@ static void external_snapshot_prepare(BlkActionState 
>>> *common,
>>>            goto out;
>>>        }
>>>    
>>> -    if (state->new_bs->backing != NULL) {
>>> +    if (bdrv_filtered_cow_child(state->new_bs)) {
>>
>> Do we allow to create filter snapshot? We should either restrict it 
>> explicitly or
>> check bdrv_filtered_child here.. And we can't allow file-based-filters 
>> anyway..
> 
> Hm, yes, we should probably check both (separately to give better error
> messages).
> 
> In theory it might be possible to allow filters on top, but there isn’t
> really any point.  If someone wants to add filters on top of the
> snapshot, they should use reopen.
> 
>> [skipped up to the end of blockdev.c, I'm tired o_O]
> 
> I can very much relate. :-)
> 
> Your review definitely is much appreciated.
> 
>>> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
>>> index d1bb863cb6..f99f753fba 100644
>>> --- a/migration/block-dirty-bitmap.c
>>> +++ b/migration/block-dirty-bitmap.c
>>> @@ -285,9 +285,7 @@ static int init_dirty_bitmap_migration(void)
>>>            const char *drive_name = bdrv_get_device_or_node_name(bs);
>>>    
>>>            /* skip automatically inserted nodes */
>>> -        while (bs && bs->drv && bs->implicit) {
>>> -            bs = backing_bs(bs);
>>> -        }
>>> +        bs = bdrv_skip_implicit_filters(bs);
>>
>> this intersects with Jonh's patch
>> [PATCH v2] migration/dirty-bitmaps: change bitmap enumeration method
>> https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03340.html
> 
> Well.  I’m not really considerate of other patches with this series.
> Rebasing is always such a pain that I just write it for the current
> master.  I won’t incorporate unmerged series because doing so may cause
> me to have to rebase more than once.
> 
> And I can’t get this series merged soon enough because it’s just wrong
> that I (and you) have to the one(s) thinking about how to treat filters
> everywhere.  It should be the people that introduce the code.

I think it's just impossible to think over all filter use-cases now. Better
to stop at some point, and then fix wrong things while covering real use-cases
by io-tests.

> 
>>>            for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap;
>>>                 bitmap = bdrv_dirty_bitmap_next(bs, bitmap))
>>> diff --git a/nbd/server.c b/nbd/server.c
>>> index e21bd501dc..e41ae89dbe 100644
>>> --- a/nbd/server.c
>>> +++ b/nbd/server.c
>>> @@ -1506,13 +1506,13 @@ NBDExport *nbd_export_new(BlockDriverState *bs, 
>>> uint64_t dev_offset,
>>>        if (bitmap) {
>>>            BdrvDirtyBitmap *bm = NULL;
>>>    
>>> -        while (true) {
>>> +        while (bs) {
>>>                bm = bdrv_find_dirty_bitmap(bs, bitmap);
>>> -            if (bm != NULL || bs->backing == NULL) {
>>> +            if (bm != NULL) {
>>>                    break;
>>>                }
>>>    
>>> -            bs = bs->backing->bs;
>>> +            bs = bdrv_filtered_bs(bs);
>>>            }
>>
>> Check in documentation: "@bitmap: Also export the dirty bitmap reachable 
>> from @device".
>>
>> "Reachable" is not bad, but we may want to clarify that extended backing 
>> chain is meant
> 
> Hm...  Isn’t that just a problem with the current documentation?
> 

Yes. Anyway, it's not necessary to fix it in theses series.

> I think this change in code better fits what I’d guess from “reachable”
> than what it currently means.
> 
>>>            if (bm == NULL) {
>>> diff --git a/qemu-img.c b/qemu-img.c
>>> index aa6f81f1ea..bcfbb743fc 100644
>>> --- a/qemu-img.c
>>> +++ b/qemu-img.c
> 
> [...]
> 
>>> @@ -2434,7 +2433,8 @@ static int img_convert(int argc, char **argv)
>>>             * s.target_backing_sectors has to be negative, which it will
>>>             * be automatically).  The backing file length is used only
>>>             * for optimizations, so such a case is not fatal. */
>>> -        s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs);
>>> +        s.target_backing_sectors =
>>> +            bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs));
>>
>> can't out_bs be filter itself?
> 
> why would you do that
> 
> More serious, well, perhaps, in theory.  In practice I really cannot
> imagine why it would be.

Throttling? But it has file child and will not work with backing anyway. And
we don't have public backing-based filters anyway. So, I don't care, it's OK.

> 
>>
>>>        } else {
>>>            s.target_backing_sectors = -1;
>>>        }
>>> @@ -2797,6 +2797,7 @@ static int get_block_status(BlockDriverState *bs, 
>>> int64_t offset,
>>>    
>>>        depth = 0;
>>>        for (;;) {
>>> +        bs = bdrv_skip_rw_filters(bs);
>>
>> Why? Filters may have own implementation of block_status, why to skip it?
>>
>> Or, thay cannot? Really, may be disallow filters have block_status, we may 
>> solve
>> inefficient block_status_above we talked about before.
> 
> As said in the other subthread, I think ignoring filters here is fine.
> 
> Max
> 
>>>            ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file);
>>>            if (ret < 0) {
>>>                return ret;
> 


-- 
Best regards,
Vladimir

reply via email to

[Prev in Thread] Current Thread [Next in Thread]