[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH V3] block/rbd: implement bdrv_co_block_status
From: |
Ilya Dryomov |
Subject: |
Re: [PATCH V3] block/rbd: implement bdrv_co_block_status |
Date: |
Thu, 6 Jan 2022 17:01:33 +0100 |
On Thu, Jan 6, 2022 at 4:27 PM Peter Lieven <pl@kamp.de> wrote:
>
> Am 05.10.21 um 10:36 schrieb Ilya Dryomov:
> > On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven <pl@kamp.de> wrote:
> >> Am 05.10.21 um 09:54 schrieb Ilya Dryomov:
> >>> On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven <pl@kamp.de> wrote:
> >>>> the qemu rbd driver currently lacks support for bdrv_co_block_status.
> >>>> This results mainly in incorrect progress during block operations (e.g.
> >>>> qemu-img convert with an rbd image as source).
> >>>>
> >>>> This patch utilizes the rbd_diff_iterate2 call from librbd to detect
> >>>> allocated and unallocated (all zero areas).
> >>>>
> >>>> To avoid querying the ceph OSDs for the answer this is only done if
> >>>> the image has the fast-diff feature which depends on the object-map and
> >>>> exclusive-lock features. In this case it is guaranteed that the
> >>>> information
> >>>> is present in memory in the librbd client and thus very fast.
> >>>>
> >>>> If fast-diff is not available all areas are reported to be allocated
> >>>> which is the current behaviour if bdrv_co_block_status is not
> >>>> implemented.
> >>>>
> >>>> Signed-off-by: Peter Lieven <pl@kamp.de>
> >>>> ---
> >>>> V2->V3:
> >>>> - check rbd_flags every time (they can change during runtime) [Ilya]
> >>>> - also check for fast-diff invalid flag [Ilya]
> >>>> - *map and *file cant be NULL [Ilya]
> >>>> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
> >>>> unallocated area [Ilya]
> >>>> - typo: catched -> caught [Ilya]
> >>>> - changed wording about fast-diff, object-map and exclusive lock in
> >>>> commit msg [Ilya]
> >>>>
> >>>> V1->V2:
> >>>> - add commit comment [Stefano]
> >>>> - use failed_post_open [Stefano]
> >>>> - remove redundant assert [Stefano]
> >>>> - add macro+comment for the magic -9000 value [Stefano]
> >>>> - always set *file if its non NULL [Stefano]
> >>>>
> >>>> block/rbd.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>> 1 file changed, 126 insertions(+)
> >>>>
> >>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>> index dcf82b15b8..3cb24f9981 100644
> >>>> --- a/block/rbd.c
> >>>> +++ b/block/rbd.c
> >>>> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific
> >>>> *qemu_rbd_get_specific_info(BlockDriverState *bs,
> >>>> return spec_info;
> >>>> }
> >>>>
> >>>> +typedef struct rbd_diff_req {
> >>>> + uint64_t offs;
> >>>> + uint64_t bytes;
> >>>> + int exists;
> >>> Hi Peter,
> >>>
> >>> Nit: make exists a bool. The one in the callback has to be an int
> >>> because of the callback signature but let's not spread that.
> >>>
> >>>> +} rbd_diff_req;
> >>>> +
> >>>> +/*
> >>>> + * rbd_diff_iterate2 allows to interrupt the exection by returning a
> >>>> negative
> >>>> + * value in the callback routine. Choose a value that does not conflict
> >>>> with
> >>>> + * an existing exitcode and return it if we want to prematurely stop the
> >>>> + * execution because we detected a change in the allocation status.
> >>>> + */
> >>>> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
> >>>> +
> >>>> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
> >>>> + int exists, void *opaque)
> >>>> +{
> >>>> + struct rbd_diff_req *req = opaque;
> >>>> +
> >>>> + assert(req->offs + req->bytes <= offs);
> >>>> +
> >>>> + if (req->exists && offs > req->offs + req->bytes) {
> >>>> + /*
> >>>> + * we started in an allocated area and jumped over an
> >>>> unallocated area,
> >>>> + * req->bytes contains the length of the allocated area before
> >>>> the
> >>>> + * unallocated area. stop further processing.
> >>>> + */
> >>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >>>> + }
> >>>> + if (req->exists && !exists) {
> >>>> + /*
> >>>> + * we started in an allocated area and reached a hole.
> >>>> req->bytes
> >>>> + * contains the length of the allocated area before the hole.
> >>>> + * stop further processing.
> >>>> + */
> >>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >>> Do you have a test case for when this branch is taken?
> >>
> >> That would happen if you diff from a snapshot, the question is if it can
> >> also happen if the image is a clone from a snapshot?
> >>
> >>
> >>>> + }
> >>>> + if (!req->exists && exists && offs > req->offs) {
> >>>> + /*
> >>>> + * we started in an unallocated area and hit the first allocated
> >>>> + * block. req->bytes must be set to the length of the
> >>>> unallocated area
> >>>> + * before the allocated area. stop further processing.
> >>>> + */
> >>>> + req->bytes = offs - req->offs;
> >>>> + return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >>>> + }
> >>>> +
> >>>> + /*
> >>>> + * assert that we caught all cases above and allocation state has
> >>>> not
> >>>> + * changed during callbacks.
> >>>> + */
> >>>> + assert(exists == req->exists || !req->bytes);
> >>>> + req->exists = exists;
> >>>> +
> >>>> + /*
> >>>> + * assert that we either return an unallocated block or have got
> >>>> callbacks
> >>>> + * for all allocated blocks present.
> >>>> + */
> >>>> + assert(!req->exists || offs == req->offs + req->bytes);
> >>>> + req->bytes = offs + len - req->offs;
> >>>> +
> >>>> + return 0;
> >>>> +}
> >>>> +
> >>>> +static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
> >>>> + bool want_zero,
> >>>> int64_t offset,
> >>>> + int64_t bytes, int64_t
> >>>> *pnum,
> >>>> + int64_t *map,
> >>>> + BlockDriverState
> >>>> **file)
> >>>> +{
> >>>> + BDRVRBDState *s = bs->opaque;
> >>>> + int ret, r;
> >>> Nit: I would rename ret to status or something like that to make
> >>> it clear(er) that it is an actual value and never an error. Or,
> >>> even better, drop it entirely and return one of the two bitmasks
> >>> directly.
> >>>
> >>>> + struct rbd_diff_req req = { .offs = offset };
> >>>> + uint64_t features, flags;
> >>>> +
> >>>> + assert(offset + bytes <= s->image_size);
> >>>> +
> >>>> + /* default to all sectors allocated */
> >>>> + ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> >>>> + *map = offset;
> >>>> + *file = bs;
> >>>> + *pnum = bytes;
> >>>> +
> >>>> + /* check if RBD image supports fast-diff */
> >>>> + r = rbd_get_features(s->image, &features);
> >>>> + if (r < 0) {
> >>>> + goto out;
> >>>> + }
> >>>> + if (!(features & RBD_FEATURE_FAST_DIFF)) {
> >>>> + goto out;
> >>>> + }
> >>>> +
> >>>> + /* check if RBD fast-diff result is valid */
> >>>> + r = rbd_get_flags(s->image, &flags);
> >>>> + if (r < 0) {
> >>>> + goto out;
> >>>> + }
> >>>> + if (flags & RBD_FLAG_FAST_DIFF_INVALID) {
> >>>> + goto out;
> >>> Nit: I'm not a fan of labels that cover just the return statement.
> >>> Feel free to ignore this one but I would get rid of it and replace
> >>> these gotos with returns.
> >>
> >> That would be return with the bitmask directly coded in if I also
> >>
> >> drop the ret variable. I can change that, no problem.
> >>
> >>
> >>>> + }
> >>>> +
> >>>> + r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
> >>>> + qemu_rbd_co_block_status_cb, &req);
> >>>> + if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) {
> >>>> + goto out;
> >>>> + }
> >>>> + assert(req.bytes <= bytes);
> >>>> + if (!req.exists) {
> >>>> + if (r == 0 && !req.bytes) {
> >>>> + /*
> >>>> + * rbd_diff_iterate2 does not invoke callbacks for
> >>>> unallocated areas
> >>>> + * except for the case where an overlay has a hole where
> >>>> the parent
> >>>> + * has not. This here catches the case where no callback was
> >>>> + * invoked at all.
> >>>> + */
> >>> The above is true in the case of diffing against a snapshot, i.e. when
> >>> the "from" snapshot has some data where the "to" revision (whether HEAD
> >>> or another snapshot) has a hole. I don't think it is true for child vs
> >>> parent (but it has been a while since I looked at the diff code). As
> >>> long as NULL is passed for fromsnapname, I would expect the callback to
> >>> be invoked only for allocated areas. If I'm right, we could simplify
> >>> qemu_rbd_co_block_status_cb() considerably.
> >> See my comment in the callback. Can you look at the diff code or give
> >> me at least a pointer where I can find it. I am not that familiar with
> >> the librbd code yet.
> > I assumed that you added !exists handling because it came up in your
> > testing. If you don't have a test case then let's proceed under the
> > assumption that it doesn't happen for clones.
>
>
> Hi Ilya,
>
>
> it seems that our assumption was false. I have an image which shows holes
> without diffing against
>
> a snapshot. Do you have an idea why?
>
>
> $ rbd --conf /etc/ceph/ceph-dev01.conf --id lieven info
> dhp-standard/c4ca7ee9-36ce-4fc9-9d3b-ece8a4f8b83e/c1ad11d0-4f6a-4cc1-8aa3-ff3c413c1471.raw
> rbd image 'c1ad11d0-4f6a-4cc1-8aa3-ff3c413c1471.raw':
> size 20 GiB in 20000 objects
> order 20 (1 MiB objects)
> snapshot_count: 2
> id: 3d6daa102e4d9f
> block_name_prefix: rbd_data.3d6daa102e4d9f
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> op_features:
> flags:
> create_timestamp: Tue Sep 21 14:16:56 2021
> access_timestamp: Thu Jan 6 15:24:46 2022
> modify_timestamp: Thu Jan 6 15:45:42 2022
>
>
> $ rbd --conf /etc/ceph/ceph-dev01.conf --id lieven snap ls
> dhp-standard/c4ca7ee9-36ce-4fc9-9d3b-ece8a4f8b83e/c1ad11d0-4f6a-4cc1-8aa3-ff3c413c1471.raw
> SNAPID NAME SIZE PROTECTED TIMESTAMP
> 12297 dlp-20210921-144509 20 GiB Tue Sep 21 14:45:09 2021
> 17745 dlp-20220106-040000 20 GiB Thu Jan 6 04:00:01 2022
>
>
> $ rbd --conf /etc/ceph/ceph-dev01.conf --id lieven diff --whole-object
> dhp-standard/c4ca7ee9-36ce-4fc9-9d3b-ece8a4f8b83e/c1ad11d0-4f6a-4cc1-8aa3-ff3c413c1471.raw
> | grep zero
> 204472320 1048576 zero
> 1114636288 1048576 zero
> 1115684864 1048576 zero
> 1116733440 1048576 zero
> 1117782016 1048576 zero
> 1218445312 1048576 zero
> 1219493888 1048576 zero
> 1220542464 1048576 zero
Hi Peter,
Yes, I stumbled upon this just yesterday while looking into another
librbd issue surfaced by this patch [1] and was going to email you and
the list after wrapping my head around this behavior. I see where it
is coming from but I'm not sure what the right fix is. I would prefer
to patch librbd but that may turn out to be not feasible. Let me get
back on this next week.
[1] https://tracker.ceph.com/issues/53784
Thanks,
Ilya