qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to emp


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas
Date: Wed, 22 May 2019 22:40:02 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

On 16.05.19 16:27, Anton Nefedov wrote:
> If COW areas of the newly allocated clusters are zeroes on the backing
> image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
> used on the whole cluster instead of writing explicit zero buffers later
> in perform_cow().
> 
> iotest 060:
> write to the discarded cluster does not trigger COW anymore.
> Use a backing image instead.
> 
> Signed-off-by: Anton Nefedov <address@hidden>
> ---
>  qapi/block-core.json       |  4 +-
>  block/qcow2.h              |  6 +++
>  block/qcow2-cluster.c      |  2 +-
>  block/qcow2.c              | 93 +++++++++++++++++++++++++++++++++++++-
>  block/trace-events         |  1 +
>  tests/qemu-iotests/060     |  7 ++-
>  tests/qemu-iotests/060.out |  5 +-
>  7 files changed, 112 insertions(+), 6 deletions(-)

[...]

> diff --git a/block/qcow2.c b/block/qcow2.c
> index 8e024007db..e6b1293ddf 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c

[...]

> @@ -2145,6 +2150,80 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>      return false;
>  }
>  
> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t 
> bytes)
> +{
> +    int64_t nr;
> +    return !bytes ||
> +        (!bdrv_is_allocated_above(bs, NULL, offset, bytes, &nr) && nr == 
> bytes);

It's a pity that this bdrv_is_allocated() throws away BDRV_BLOCK_ZERO
information.  If something in the backing chain has explicit zero
clusters there, we could get the information for free, but this will
consider it allocated.

Wouldn't it make sense to make bdrv_co_block_status_above() public and
then use that (with want_zero == false)?

(But that can be done later, too, of course.)

> +}
> +
> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
> +{
> +    /*
> +     * This check is designed for optimization shortcut so it must be
> +     * efficient.
> +     * Instead of is_zero(), use is_unallocated() as it is faster (but not
> +     * as accurate and can result in false negatives).
> +     */
> +    return is_unallocated(bs, m->offset + m->cow_start.offset,
> +                          m->cow_start.nb_bytes) &&
> +           is_unallocated(bs, m->offset + m->cow_end.offset,
> +                          m->cow_end.nb_bytes);
> +}
> +
> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    QCowL2Meta *m;
> +
> +    if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
> +        return 0;
> +    }
> +
> +    if (bs->encrypted) {
> +        return 0;
> +    }
> +
> +    for (m = l2meta; m != NULL; m = m->next) {
> +        int ret;
> +
> +        if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
> +            continue;
> +        }
> +
> +        if (!is_zero_cow(bs, m)) {
> +            continue;
> +        }
> +
> +        /*
> +         * instead of writing zero COW buffers,
> +         * efficiently zero out the whole clusters
> +         */
> +
> +        ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
> +                                            m->nb_clusters * s->cluster_size,
> +                                            true);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
> +        ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
> +                                    m->nb_clusters * s->cluster_size,
> +                                    BDRV_REQ_NO_FALLBACK);

Mostly I'm writing this mail because of this.  Zeroing the whole range
seems inefficient to me for very large requests, and the commit message
doesn't say anything about the reasoning behind unconditionally zeroing
everything.

Benchmarking looks like in most cases it is about equal to zeroing head
and tail separately.  But if I pre-filll an image with random data, it
is much slower:

$ qemu-img create -f qcow2 foo.qcow2 10G
$ dd if=/dev/urandom of=foo.qcow2 bs=1M seek=1 count=4096
$ sync

Then doing large unaligned requests on it:

$ ./qemu-img bench -w -t none -s $((400 * 1048576)) \
  -S $((401 * 1048576)) -o 32768 -c 10 -d 2 -f qcow2 foo.qcow2

When zeroing the whole range, this usually takes around 25 s for me;
just zeroing head and tail takes around 14 s. Without this patch, it
takes around 14 s, too.

On the other hand, when doing smaller requests on a single cluster
(which is what this patch is for):

$ ./qemu-img bench -w -t none -s 4096 -S 65536 -o 32768 \
  -f qcow2 foo.qcow2

This takes around 26 s before this patch, 12 s with it, and like 30 to
40 when zeroing head and tail separately.


Now, such large requests surely are the exception, as is allocating
blocks in an area of the image that already contains data.  However, I
just want to ask back that zeroing the whole range unconditionally is
done on purpose.  Maybe it makes sense to split head and tail if they
are like more than, I don't know, 16 MB apart?  But the "I don't know"
is the problem of course.  Is there a way to make a good default?


(Note that I wrote a lot, but it's not like I'm making a good point to
stop this patch.  I just want to have asked about this before I take it.)

Max

> +        if (ret < 0) {
> +            if (ret != -ENOTSUP && ret != -EAGAIN) {
> +                return ret;
> +            }
> +            continue;
> +        }
> +
> +        trace_qcow2_skip_cow(qemu_coroutine_self(), m->offset, 
> m->nb_clusters);
> +        m->skip_cow = true;
> +    }
> +    return 0;
> +}
> +
>  static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t 
> offset,
>                                           uint64_t bytes, QEMUIOVector *qiov,
>                                           int flags)

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]