qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 1/2] qcow2: Improve refcount structure rebuilding


From: Eric Blake
Subject: Re: [PATCH v2 1/2] qcow2: Improve refcount structure rebuilding
Date: Wed, 30 Mar 2022 09:32:06 -0500
User-agent: NeoMutt/20211029-524-5b0234

On Tue, Mar 29, 2022 at 11:19:16AM +0200, Hanna Reitz wrote:
> When rebuilding the refcount structures (when qemu-img check -r found
> errors with refcount = 0, but reference count > 0), the new refcount
> table defaults to being put at the image file end[1].  There is no good
> reason for that except that it means we will not have to rewrite any
> refblocks we already wrote to disk.
> 
> Changing the code to rewrite those refblocks is not too difficult,
> though, so let us do that.  That is beneficial for images on block
> devices, where we cannot really write beyond the end of the image file.
> 
> Use this opportunity to add extensive comments to the code, and refactor
> it a bit, getting rid of the backwards-jumping goto.
> 
> [1] Unless there is something allocated in the area pointed to by the
>     last refblock, so we have to write that refblock.  In that case, we
>     try to put the reftable in there.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1519071
> Closes: https://gitlab.com/qemu-project/qemu/-/issues/941
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---
>  block/qcow2-refcount.c | 332 +++++++++++++++++++++++++++++------------
>  1 file changed, 235 insertions(+), 97 deletions(-)
> 

> +static int rebuild_refcounts_write_refblocks(
> +        BlockDriverState *bs, void **refcount_table, int64_t *nb_clusters,
> +        int64_t first_cluster, int64_t end_cluster,
> +        uint64_t **on_disk_reftable_ptr, uint32_t 
> *on_disk_reftable_entries_ptr
> +    )

As you are rewriting this into a helper function, should this function
take Error **errp,...

> +            /* Don't allocate a cluster in a refblock already written to 
> disk */
> +            if (first_free_cluster < refblock_start) {
> +                first_free_cluster = refblock_start;
> +            }
> +            refblock_offset = alloc_clusters_imrt(bs, 1, refcount_table,
> +                                                  nb_clusters,
> +                                                  &first_free_cluster);
> +            if (refblock_offset < 0) {
> +                fprintf(stderr, "ERROR allocating refblock: %s\n",
> +                        strerror(-refblock_offset));
> +                return refblock_offset;

...instead of using fprintf(stderr), where the caller then handles the
error by printing?

Could be done as a separate patch.

>  
> +        /* Refblock is allocated, write it to disk */
> +
>          ret = qcow2_pre_write_overlap_check(bs, 0, refblock_offset,
>                                              s->cluster_size, false);
>          if (ret < 0) {
>              fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
> -            goto fail;
> +            return ret;
>          }

Another spot where errp conversion might improve the code.

>  
> -        /* The size of *refcount_table is always cluster-aligned, therefore 
> the
> -         * write operation will not overflow */
> +        /*
> +         * The refblock is simply a slice of *refcount_table.
> +         * Note that the size of *refcount_table is always aligned to
> +         * whole clusters, so the write operation will not result in
> +         * out-of-bounds accesses.
> +         */
>          on_disk_refblock = (void *)((char *) *refcount_table +
>                                      refblock_index * s->cluster_size);
>  
> @@ -2550,23 +2579,99 @@ write_refblocks:
>                            s->cluster_size);
>          if (ret < 0) {
>              fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
> -            goto fail;
> +            return ret;
>          }

and another

>  
> -        /* Go to the end of this refblock */
> +        /* This refblock is done, skip to its end */
>          cluster = refblock_start + s->refcount_block_size - 1;
>      }
>  
> -    if (reftable_offset < 0) {
> -        uint64_t post_refblock_start, reftable_clusters;
> +    return reftable_grown;
> +}

The helper function looks okay.

> +
> +/*
> + * Creates a new refcount structure based solely on the in-memory information
> + * given through *refcount_table (this in-memory information is basically 
> just
> + * the concatenation of all refblocks).  All necessary allocations will be
> + * reflected in that array.
> + *
> + * On success, the old refcount structure is leaked (it will be covered by 
> the
> + * new refcount structure).
> + */
> +static int rebuild_refcount_structure(BlockDriverState *bs,
> +                                      BdrvCheckResult *res,
> +                                      void **refcount_table,
> +                                      int64_t *nb_clusters)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int64_t reftable_offset = -1;
> +    int64_t reftable_length = 0;
> +    int64_t reftable_clusters;
> +    int64_t refblock_index;
> +    uint32_t on_disk_reftable_entries = 0;
> +    uint64_t *on_disk_reftable = NULL;
> +    int ret = 0;
> +    int reftable_size_changed = 0;
> +    struct {
> +        uint64_t reftable_offset;
> +        uint32_t reftable_clusters;
> +    } QEMU_PACKED reftable_offset_and_clusters;
> +
> +    qcow2_cache_empty(bs, s->refcount_block_cache);
> +
> +    /*
> +     * For each refblock containing entries, we try to allocate a
> +     * cluster (in the in-memory refcount table) and write its offset
> +     * into on_disk_reftable[].  We then write the whole refblock to
> +     * disk (as a slice of the in-memory refcount table).
> +     * This is done by rebuild_refcounts_write_refblocks().
> +     *
> +     * Once we have scanned all clusters, we try to find space for the
> +     * reftable.  This will dirty the in-memory refcount table (i.e.
> +     * make it differ from the refblocks we have already written), so we
> +     * need to run rebuild_refcounts_write_refblocks() again for the
> +     * range of clusters where the reftable has been allocated.
> +     *
> +     * This second run might make the reftable grow again, in which case
> +     * we will need to allocate another space for it, which is why we
> +     * repeat all this until the reftable stops growing.
> +     *
> +     * (This loop will terminate, because with every cluster the
> +     * reftable grows, it can accomodate a multitude of more refcounts,
> +     * so that at some point this must be able to cover the reftable
> +     * and all refblocks describing it.)
> +     *
> +     * We then convert the reftable to big-endian and write it to disk.
> +     *
> +     * Note that we never free any reftable allocations.  Doing so would
> +     * needlessly complicate the algorithm: The eventual second check
> +     * run we do will clean up all leaks we have caused.
> +     */

Freeing reftable allocations from the first run might allow the second
(or third) to find a spot earlier in the image that is large enough to
contain the resized reftable, compared to leaving it leaked and
forcing subsequent runs to look later into the image.  But I agree
that the complication of code needed to handle that is not worth the
minor corner-case savings of a more densely packed overall image (the
leaked clusters will probably remain sparse for any decent cluster
size).

Another approach might be to intentionally over-allocate the reftable
to the point where we know it can't grow, then allocate, then scale it
back down to how much we actually used (possibly reclaiming a few
clusters at the end of the allocation).  But that's an even bigger
rewrite, and again, not worth the headache.

512-byte clusters would be where this is most likely to be noticeable
(that is, hitting a 3rd iteration with 512-byte clusters is probably
easy enough to actually test, but hitting a 3rd iteration with 2M
clusters is probably prohibitively expensive if even possible).

> +
> +    reftable_size_changed =
> +        rebuild_refcounts_write_refblocks(bs, refcount_table, nb_clusters,
> +                                          0, *nb_clusters,
> +                                          &on_disk_reftable,
> +                                          &on_disk_reftable_entries);
> +    if (reftable_size_changed < 0) {
> +        res->check_errors++;
> +        ret = reftable_size_changed;
> +        goto fail;
> +    }
> +
> +    /*
> +     * There was no reftable before, so rebuild_refcounts_write_refblocks()
> +     * must have increased its size (from 0 to something).
> +     */
> +    assert(reftable_size_changed == true);

'int == bool'.  Works, but is a bit odd.  I might have done just
assert(reftable_size_changed), since we just ruled out negative values
above.  Particularly since...

> +
> +    do {
> +        int64_t reftable_start_cluster, reftable_end_cluster;
> +        int64_t first_free_cluster = 0;
...
> +
> +        /*
> +         * If the reftable size has changed, we will need to find a new
> +         * allocation, repeating the loop.
> +         */
> +    } while (reftable_size_changed);

...here you ARE using the int as a bool directly.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]