qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table


From: Jun Li
Subject: Re: [Qemu-devel] [PATCH v5 1/3] qcow2: Add qcow2_shrink_l1_and_l2_table for qcow2 shrinking
Date: Sat, 3 Jan 2015 20:23:12 +0800
User-agent: Mutt/1.5.23 (2014-03-12)

On Fri, 11/21 11:56, Max Reitz wrote:
> On 2014-10-26 at 16:20, Jun Li wrote:
> >This patch is the realization of new function qcow2_shrink_l1_and_l2_table.
> >This function will shrink/discard l1 and l2 table when do qcow2 shrinking.
> >
> >Signed-off-by: Jun Li <address@hidden>
> >---
> >v5:
> >   Do some modifications based on MAX's suggestion. Thanks for MAX.
> >   In v5, do l2 shrinking firstly, then do l1 shrinking in function 
> > qcow2_shrink_l1_and_l2_table. As do l1 shrinking need to allocate some 
> > clusters for new l1 table, so in v5 it can re-use the freed clusters come 
> > from l2 shrinking.
> >
> >v4:
> >   Add deal with COW clusters in l2 table. When using COW, some of (l2_entry 
> > >>
> >s->cluster_bits) will larger than s->refcount_table_size, so need to discard
> >this l2_entry.
> >
> >v3:
> >   Fixed host cluster leak.
> >---
> >  block/qcow2-cluster.c | 182 
> > ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  block/qcow2.c         |  37 +++++++++-
> >  block/qcow2.h         |   2 +
> >  3 files changed, 218 insertions(+), 3 deletions(-)
> >
> >diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> >index 4d888c7..28d2d62 100644
> >--- a/block/qcow2-cluster.c
> >+++ b/block/qcow2-cluster.c
> >@@ -29,6 +29,9 @@
> >  #include "block/qcow2.h"
> >  #include "trace.h"
> >+static int l2_load(BlockDriverState *bs, uint64_t l2_offset,
> >+                   uint64_t **l2_table);
> >+
> >  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
> >                          bool exact_size)
> >  {
> >@@ -135,6 +138,185 @@ int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t 
> >min_size,
> >      return ret;
> >  }
> >+int qcow2_shrink_l1_and_l2_table(BlockDriverState *bs, uint64_t new_l1_size,
> >+                                 int new_l2_index, int64_t boundary_size)
> >+{
> >+    BDRVQcowState *s = bs->opaque;
> >+    int new_l1_size2, ret, i;
> 
> gcc gives me a warning (which is made an error by -Werror) here due to
> uninitialized use of "ret". "ret" is not necessarily set in the while ()
> loop and there are some "goto fail"s afterwards which do not set it.
> 
> >+    uint64_t *new_l1_table;
> >+    int64_t new_l1_table_offset;
> >+    int64_t old_l1_table_offset, old_l1_size;
> >+    uint8_t data[12];
> >+    uint64_t l2_offset;
> >+    uint64_t *l2_table, l2_entry;
> >+    int64_t l2_free_entry; /* The entry of l2 table need to free from */
> >+    uint64_t *old_l1_table = s->l1_table;
> >+    int num = s->l1_size - new_l1_size;
> >+
> >+    assert(new_l1_size <= s->l1_size);
> >+    while ((num >= -1) && (s->l1_size + num - 1 >= 0)) {
> >+        l2_free_entry = 0;
> >+        l2_offset = old_l1_table[s->l1_size + num - 1] & L1E_OFFSET_MASK;
> >+
> >+        if (l2_offset == 0) {
> >+            goto retry;
> >+        }
> >+
> >+        if (num == 0) {
> >+            if (new_l2_index == 0) {
> >+                goto retry;
> >+            }
> >+            l2_free_entry = new_l2_index;
> >+        }
> >+
> >+        /* load l2_table into cache */
> >+        ret = l2_load(bs, l2_offset, &l2_table);
> >+
> >+        if (ret < 0) {
> >+            return ret;
> >+        }
> >+
> >+        for (i = s->l2_size - 1; i >= 0; i--) {
> >+            l2_entry = be64_to_cpu(l2_table[i]);
> 
> & L2E_OFFSET_MASK is missing. OFLAG_COPIED will break this without.
> 
> >+
> >+            /* Due to COW, the clusters in l2 table will
> >+             * not in sequential order, so there will be
> >+             * some l2_entry >= boundary_size when perform shrinking.
> >+             */
> >+            if (num == -1) {
> >+                if (l2_entry >= boundary_size) {
> 
> boundary_size is set to "offset" by the caller. "offset" is the guest disk
> size. l2_entry (or should be) a host offset. They are not comparable.
> 
> >+                    goto free_cluster;
> >+                } else {
> >+                    continue;
> >+                }
> >+            }
> >+
> >+            /* Deal with COW clusters in l2 table when num == 0 */
> >+            if (i <= l2_free_entry - 1) {
> >+                if (l2_entry >= boundary_size) {
> >+                    goto free_cluster;
> >+                }
> >+                continue;
> >+            }
> >+
> >+            switch (qcow2_get_cluster_type(l2_entry)) {
> >+            case QCOW2_CLUSTER_UNALLOCATED:
> >+                if (!bs->backing_hd) {
> >+                    continue;
> >+                }
> >+                break;
> >+
> >+            case QCOW2_CLUSTER_ZERO:
> >+                continue;
> >+
> >+            case QCOW2_CLUSTER_NORMAL:
> >+            case QCOW2_CLUSTER_COMPRESSED:
> >+                break;
> >+
> >+            default:
> >+                abort();
> >+            }
> >+
> >+        free_cluster:
> >+            qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
> >+
> >+            if (s->qcow_version >= 3) {
> >+                l2_table[i] = cpu_to_be64(QCOW_OFLAG_ZERO);
> >+            } else {
> >+                l2_table[i] = cpu_to_be64(0);
> >+            }
> >+
> >+            /* Then decrease the refcount */
> >+            qcow2_free_any_clusters(bs, l2_entry, 1, QCOW2_DISCARD_MAX);
> >+        }
> >+
> >+        ret = qcow2_cache_put(bs, s->l2_table_cache, (void **) &l2_table);
> >+        if (ret < 0) {
> >+            return ret;
> >+        }
> >+        if (l2_free_entry == 0 && num != -1) {
> >+            qemu_vfree(l2_table);
> >+            qcow2_free_clusters(bs, l2_offset, s->cluster_size - 1,
> >+                                QCOW2_DISCARD_OTHER);
> 
> You should probably flush the L2 cache before you do that.
> 
> >+        }
> >+    retry:
> >+        num--;
> >+    }
> >+
> >+    new_l1_size2 = sizeof(uint64_t) * new_l1_size;
> >+    new_l1_table = qemu_try_blockalign(bs->file,
> >+                                       align_offset(new_l1_size2, 512));
> >+    if (new_l1_table == NULL) {
> >+        return -ENOMEM;
> >+    }
> >+    memset(new_l1_table, 0, align_offset(new_l1_size2, 512));
> >+
> >+    /* shrinking l1 table */
> >+    memcpy(new_l1_table, s->l1_table, new_l1_size2);
> >+
> >+    /* write new table (align to cluster) */
> >+    new_l1_table_offset = qcow2_alloc_clusters(bs, new_l1_size2);
> >+
> >+    if ((new_l1_table_offset) >= boundary_size) {
> >+        goto fail;
> >+    }
> >+
> >+    ret = qcow2_cache_flush(bs, s->refcount_block_cache);
> >+    if (ret < 0) {
> >+        goto fail;
> >+    }
> 
> You don't need to flush the refblock cache.
> 
> >+
> >+    /* the L1 position has not yet been updated, so these clusters must
> >+     * indeed be completely free */
> >+    ret = qcow2_pre_write_overlap_check(bs, 0, new_l1_table_offset,
> >+                                        new_l1_size2);
> >+
> >+    if (ret < 0) {
> >+        goto fail;
> >+    }
> >+
> >+    BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
> >+
> >+    for (i = 0; i < new_l1_size; i++) {
> >+        new_l1_table[i] = cpu_to_be64(new_l1_table[i]);
> >+    }
> >+
> >+    ret = bdrv_pwrite_sync(bs->file, new_l1_table_offset,
> >+                           new_l1_table, new_l1_size2);
> >+    if (ret < 0) {
> >+        goto fail;
> >+    }
> >+
> >+    for (i = 0; i < new_l1_size; i++) {
> >+        new_l1_table[i] = be64_to_cpu(new_l1_table[i]);
> >+    }
> >+
> >+    /* set new table */
> >+    BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ACTIVATE_TABLE);
> >+    cpu_to_be32w((uint32_t *)data, new_l1_size);
> >+    stq_be_p(data + 4, new_l1_table_offset);
> >+    ret = bdrv_pwrite_sync(bs->file, offsetof(QCowHeader, l1_size),
> >+                           data, sizeof(data));
> >+    if (ret < 0) {
> >+        goto fail;
> >+    }
> 
> I understand your intent with writing a new L1 table. But I don't see the
> need. With 64 kB clusters, a 64 kB L1 table can serve an image with 4 TB
> guest size (8192 L2 tables, each having 8192 entries). I don't think we'll
> gain a lot from shrinking the L1 table. It's too much effort.
> 
> >+
> >+    qemu_vfree(s->l1_table);
> >+    old_l1_table_offset = s->l1_table_offset;
> >+    s->l1_table_offset = new_l1_table_offset;
> >+    s->l1_table = new_l1_table;
> >+    old_l1_size = s->l1_size;
> >+    s->l1_size = new_l1_size;
> >+    qcow2_free_clusters(bs, old_l1_table_offset, old_l1_size * 
> >sizeof(uint64_t),
> >+                        QCOW2_DISCARD_OTHER);
> >+    return 0;
> >+ fail:
> >+    qemu_vfree(new_l1_table);
> >+    qcow2_free_clusters(bs, new_l1_table_offset, new_l1_size2,
> >+                        QCOW2_DISCARD_OTHER);
> >+    return ret;
> >+}
> >+
> >  /*
> >   * l2_load
> >   *
> >diff --git a/block/qcow2.c b/block/qcow2.c
> >index d031515..d2b0dfe 100644
> >--- a/block/qcow2.c
> >+++ b/block/qcow2.c
> >@@ -2111,10 +2111,41 @@ static int qcow2_truncate(BlockDriverState *bs, 
> >int64_t offset)
> >          return -ENOTSUP;
> >      }
> >-    /* shrinking is currently not supported */
> >+    /* shrinking image */
> >      if (offset < bs->total_sectors * 512) {
> >-        error_report("qcow2 doesn't support shrinking images yet");
> >-        return -ENOTSUP;
> >+        /* As l1 table, l2 table, refcount table, refcount block table
> >+         * and file header of the qcow2 image need to use some clusters,
> >+         * so should subtract these metadata from offset.
> >+         */
> >+        int64_t nb_l1 = DIV_ROUND_UP((uint64_t)s->l1_size * 
> >sizeof(uint64_t),
> >+                                     s->cluster_size);
> >+        int64_t nb_l2 = DIV_ROUND_UP(offset, (uint64_t)s->l2_size <<
> >+                                     s->cluster_bits);
> >+        int64_t nb_refcount_block_table = DIV_ROUND_UP(offset, (uint64_t)
> >+                                                       s->cluster_size <<
> >+                                                       
> >s->refcount_block_bits);
> >+        int64_t nb_refcount_table = DIV_ROUND_UP(nb_refcount_block_table << 
> >3,
> >+                                                 s->cluster_size);
> >+        int64_t total_nb = 2 * nb_l2 + nb_l1 + nb_refcount_block_table +
> >+                           nb_refcount_table + 1;
> >+        int64_t offset_for_shrink = offset - (total_nb << s->cluster_bits);
> 
> Okay, what does total_nb have to do with offset?
> 
> total_nb is some host value. offset is a guest value. Don't mix them.
> 
> >+        int new_l2_index = offset_to_l2_index(s, offset_for_shrink);
> >+
> >+        new_l1_size = size_to_l1(s, offset_for_shrink);
> >+        ret = qcow2_shrink_l1_and_l2_table(bs, new_l1_size, new_l2_index,
> >+                                           offset);
> >+        if (ret < 0) {
> >+            return ret;
> >+        }
> >+
> >+        int64_t actual_size = bdrv_get_allocated_file_size(bs);
> >+
> >+        if (offset < actual_size) {
> >+            ret = bdrv_truncate(bs->file, offset);
> >+            if (ret < 0) {
> >+                return ret;
> >+            }
> >+        }
> 
> actual_size is not some offset you can use. It's the number of bytes
> allocated in the file. The file can contain holes. Therefore, comparing an
> offset against actual_size is wrong. Second, offset is the guest disk size,
> actual_size is the host disk size. They are not comparable.
> 
> Second, why are you truncating the file to offset if offset is smaller than
> actual_size? That is always wrong. You are blindly assuming:
> (1) that the image consists of only data and no metadata, because you're
> using the guest disk size (offset) to shrink the host file
> (2) that all that data is tightly packed at the beginning of the file

Yes, it exists above issues, thank you.

In my original thought, if do not truncate the file, will hit "virtual size <
disk size", so this looks uncomfortable. Based on this, I do bdrv_truncate
after qcow2_shrink_l1_and_l2_table. What's your opinion?

> (1) is obviously wrong. (2) is wrong as well, because clusters may be spread
> all over the host file. We would need defragmentation to solve that issue.
> 
> Therefore, to shorten the host file, you'd need to walk through the image
> and find the highest *host* cluster actually in use. Then you can truncate
> to after that cluster. However, I'd strongly advise against that because it
> doesn't bring any actual benefit.
> 
> Today's file systems support holes. Just discard all the clusters which get
> freed during shrinking, that will free up all that space. No need to
> defragment, no need to truncate. Yes, the file will look larger than it
> actually is, but nobody should care. We will need defragmentation to tackle
> that issue.
> 
> >      }
> >      new_l1_size = size_to_l1(s, offset);
> >diff --git a/block/qcow2.h b/block/qcow2.h
> >index 577ccd1..be1237d 100644
> >--- a/block/qcow2.h
> >+++ b/block/qcow2.h
> >@@ -516,6 +516,8 @@ int qcow2_pre_write_overlap_check(BlockDriverState *bs, 
> >int ign, int64_t offset,
> >  /* qcow2-cluster.c functions */
> >  int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
> >                          bool exact_size);
> >+int qcow2_shrink_l1_and_l2_table(BlockDriverState *bs, uint64_t new_l1_size,
> >+                                 int new_l2_index, int64_t boundary_size);
> >  int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
> >  void qcow2_l2_cache_reset(BlockDriverState *bs);
> >  int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t 
> > cluster_offset);
> 
> So, as for what I think we do need to do when shrinking (and keep in mind:
> The offset given to qcow2_truncate() is the guest size! NOT the host image
> size!):
> 
> (1) Determine the first L2 table and the first entry in the table which will
> lie beyond the new guest disk size.

Here is not correct always. Due to the COW, using offset to calculate the
first entry of the first L2 table will be incorrect.

What I have done for this scenario:
(1) if the first entry is the first entry of the L2 table, so will scan "the
previous L2 table"("the previous L2 table" location is in front of "L2 table"
in L1 table). If the entry of previous L2 table is larger than offset, will
discard this entry, too.
(2) If the first entry is not the first entry of the L2 table, still to scan
the whole L2 table to make sure no entry is beyond offset.

> (2) Discard all clusters beginning from there.
> (3) Discard all L2 tables which are then completely empty.
> (4) Update the header size.

For this patch current's realizion, have include above 4 steps I think.
Current patch, also have another step 5.
(5) truncate the file.

Here I think we also should add discard refcount table and refcount block
table when they are completely empty.

> 
> And that's it. We can either speed up step (2) by implementing it manually,
> or we just use bdrv_discard() on the qcow2 BDS (in the simplest case:
> bdrv_discard(bs, DIV_ROUND_UP(offset, BDRV_SECTOR_SIZE), bs->total_sectors -
> DIV_ROUND_UP(offset, BDRV_SECTOR_SIZE));.
> 
> We can incorporate step (3) by extending qcow2_discard_clusters() to free L2
> tables when they are empty after discard_single_l2(). But we don't even have
> to that now. It's an optimization we can go about later.
> 
> So, we can do (1), (2) and (3) in a single step: Just one bdrv_discard()
> call. But it's probably better to use qcow2_discard_clusters() instead and
> set the full_discard parameter to true.
> 
> So: qcow2_discard_clusters(bs, offset, bs->total_sectors - offset /
> BDRV_SECTOR_SIZE, true);. Then update the guest disk size field in the
> header. And we're done.
> 
> There are four problems with this approach:
> - qcow2_discard_clusters() might be slower than optimal. I personally don't
> care at all.
> - If "bs->total_sectors * BDRV_SECTOR_SIZE - offset" is greater than
> INT_MAX, this won't work. Trivially solvable by encapsulating the
> qcow2_discard_clusters() call in a loop which limits nb_clusters to INT_MAX
> / BDRV_SECTOR_SIZE.
> - The L1 table is not resized. Should not matter in practice at all.

Yes, agree with you.

> - The file is not truncated. Does not matter either (because holes in the
> file are good enough), and we can't actually solve this problem without
> defragmentation anyway.
> 
> There is one advantage:
> - It's extremely simple. It's literally below ten lines of code.
> 
> I think the advantage far outweighs the disadvantage. But I may be wrong.
> What do you think?

Hi max,

  Sorry for so late to reply as I am so busy recently. I think let's have an
agreement on how to realize qcow2 shrinking first, then type code is better.
Another issue, as gmail can not be used in current China, I have to use this
email to reply. :)

Regards,
Jun Li




reply via email to

[Prev in Thread] Current Thread [Next in Thread]