[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 1/2] qcow2: Improve refcount structure rebuilding
From: |
Eric Blake |
Subject: |
Re: [PATCH v2 1/2] qcow2: Improve refcount structure rebuilding |
Date: |
Wed, 30 Mar 2022 09:32:06 -0500 |
User-agent: |
NeoMutt/20211029-524-5b0234 |
On Tue, Mar 29, 2022 at 11:19:16AM +0200, Hanna Reitz wrote:
> When rebuilding the refcount structures (when qemu-img check -r found
> errors with refcount = 0, but reference count > 0), the new refcount
> table defaults to being put at the image file end[1]. There is no good
> reason for that except that it means we will not have to rewrite any
> refblocks we already wrote to disk.
>
> Changing the code to rewrite those refblocks is not too difficult,
> though, so let us do that. That is beneficial for images on block
> devices, where we cannot really write beyond the end of the image file.
>
> Use this opportunity to add extensive comments to the code, and refactor
> it a bit, getting rid of the backwards-jumping goto.
>
> [1] Unless there is something allocated in the area pointed to by the
> last refblock, so we have to write that refblock. In that case, we
> try to put the reftable in there.
>
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1519071
> Closes: https://gitlab.com/qemu-project/qemu/-/issues/941
> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
> ---
> block/qcow2-refcount.c | 332 +++++++++++++++++++++++++++++------------
> 1 file changed, 235 insertions(+), 97 deletions(-)
>
> +static int rebuild_refcounts_write_refblocks(
> + BlockDriverState *bs, void **refcount_table, int64_t *nb_clusters,
> + int64_t first_cluster, int64_t end_cluster,
> + uint64_t **on_disk_reftable_ptr, uint32_t
> *on_disk_reftable_entries_ptr
> + )
As you are rewriting this into a helper function, should this function
take Error **errp,...
> + /* Don't allocate a cluster in a refblock already written to
> disk */
> + if (first_free_cluster < refblock_start) {
> + first_free_cluster = refblock_start;
> + }
> + refblock_offset = alloc_clusters_imrt(bs, 1, refcount_table,
> + nb_clusters,
> + &first_free_cluster);
> + if (refblock_offset < 0) {
> + fprintf(stderr, "ERROR allocating refblock: %s\n",
> + strerror(-refblock_offset));
> + return refblock_offset;
...instead of using fprintf(stderr), where the caller then handles the
error by printing?
Could be done as a separate patch.
>
> + /* Refblock is allocated, write it to disk */
> +
> ret = qcow2_pre_write_overlap_check(bs, 0, refblock_offset,
> s->cluster_size, false);
> if (ret < 0) {
> fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
> - goto fail;
> + return ret;
> }
Another spot where errp conversion might improve the code.
>
> - /* The size of *refcount_table is always cluster-aligned, therefore
> the
> - * write operation will not overflow */
> + /*
> + * The refblock is simply a slice of *refcount_table.
> + * Note that the size of *refcount_table is always aligned to
> + * whole clusters, so the write operation will not result in
> + * out-of-bounds accesses.
> + */
> on_disk_refblock = (void *)((char *) *refcount_table +
> refblock_index * s->cluster_size);
>
> @@ -2550,23 +2579,99 @@ write_refblocks:
> s->cluster_size);
> if (ret < 0) {
> fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
> - goto fail;
> + return ret;
> }
and another
>
> - /* Go to the end of this refblock */
> + /* This refblock is done, skip to its end */
> cluster = refblock_start + s->refcount_block_size - 1;
> }
>
> - if (reftable_offset < 0) {
> - uint64_t post_refblock_start, reftable_clusters;
> + return reftable_grown;
> +}
The helper function looks okay.
> +
> +/*
> + * Creates a new refcount structure based solely on the in-memory information
> + * given through *refcount_table (this in-memory information is basically
> just
> + * the concatenation of all refblocks). All necessary allocations will be
> + * reflected in that array.
> + *
> + * On success, the old refcount structure is leaked (it will be covered by
> the
> + * new refcount structure).
> + */
> +static int rebuild_refcount_structure(BlockDriverState *bs,
> + BdrvCheckResult *res,
> + void **refcount_table,
> + int64_t *nb_clusters)
> +{
> + BDRVQcow2State *s = bs->opaque;
> + int64_t reftable_offset = -1;
> + int64_t reftable_length = 0;
> + int64_t reftable_clusters;
> + int64_t refblock_index;
> + uint32_t on_disk_reftable_entries = 0;
> + uint64_t *on_disk_reftable = NULL;
> + int ret = 0;
> + int reftable_size_changed = 0;
> + struct {
> + uint64_t reftable_offset;
> + uint32_t reftable_clusters;
> + } QEMU_PACKED reftable_offset_and_clusters;
> +
> + qcow2_cache_empty(bs, s->refcount_block_cache);
> +
> + /*
> + * For each refblock containing entries, we try to allocate a
> + * cluster (in the in-memory refcount table) and write its offset
> + * into on_disk_reftable[]. We then write the whole refblock to
> + * disk (as a slice of the in-memory refcount table).
> + * This is done by rebuild_refcounts_write_refblocks().
> + *
> + * Once we have scanned all clusters, we try to find space for the
> + * reftable. This will dirty the in-memory refcount table (i.e.
> + * make it differ from the refblocks we have already written), so we
> + * need to run rebuild_refcounts_write_refblocks() again for the
> + * range of clusters where the reftable has been allocated.
> + *
> + * This second run might make the reftable grow again, in which case
> + * we will need to allocate another space for it, which is why we
> + * repeat all this until the reftable stops growing.
> + *
> + * (This loop will terminate, because with every cluster the
> + * reftable grows, it can accomodate a multitude of more refcounts,
> + * so that at some point this must be able to cover the reftable
> + * and all refblocks describing it.)
> + *
> + * We then convert the reftable to big-endian and write it to disk.
> + *
> + * Note that we never free any reftable allocations. Doing so would
> + * needlessly complicate the algorithm: The eventual second check
> + * run we do will clean up all leaks we have caused.
> + */
Freeing reftable allocations from the first run might allow the second
(or third) to find a spot earlier in the image that is large enough to
contain the resized reftable, compared to leaving it leaked and
forcing subsequent runs to look later into the image. But I agree
that the complication of code needed to handle that is not worth the
minor corner-case savings of a more densely packed overall image (the
leaked clusters will probably remain sparse for any decent cluster
size).
Another approach might be to intentionally over-allocate the reftable
to the point where we know it can't grow, then allocate, then scale it
back down to how much we actually used (possibly reclaiming a few
clusters at the end of the allocation). But that's an even bigger
rewrite, and again, not worth the headache.
512-byte clusters would be where this is most likely to be noticeable
(that is, hitting a 3rd iteration with 512-byte clusters is probably
easy enough to actually test, but hitting a 3rd iteration with 2M
clusters is probably prohibitively expensive if even possible).
> +
> + reftable_size_changed =
> + rebuild_refcounts_write_refblocks(bs, refcount_table, nb_clusters,
> + 0, *nb_clusters,
> + &on_disk_reftable,
> + &on_disk_reftable_entries);
> + if (reftable_size_changed < 0) {
> + res->check_errors++;
> + ret = reftable_size_changed;
> + goto fail;
> + }
> +
> + /*
> + * There was no reftable before, so rebuild_refcounts_write_refblocks()
> + * must have increased its size (from 0 to something).
> + */
> + assert(reftable_size_changed == true);
'int == bool'. Works, but is a bit odd. I might have done just
assert(reftable_size_changed), since we just ruled out negative values
above. Particularly since...
> +
> + do {
> + int64_t reftable_start_cluster, reftable_end_cluster;
> + int64_t first_free_cluster = 0;
...
> +
> + /*
> + * If the reftable size has changed, we will need to find a new
> + * allocation, repeating the loop.
> + */
> + } while (reftable_size_changed);
...here you ARE using the int as a bool directly.
Reviewed-by: Eric Blake <eblake@redhat.com>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org