qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] QCOW2 deduplication design


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] QCOW2 deduplication design
Date: Wed, 9 Jan 2013 17:16:04 +0100

On Wed, Jan 9, 2013 at 4:24 PM, Benoît Canet <address@hidden> wrote:
> Here is a mail to open a discussion on QCOW2 deduplication design and
> performance.
>
> The actual deduplication strategy is RAM based.
> One of the goal of the project is to plan and implement an alternative way to 
> do
> the lookups from disk for bigger images.
>
>
> I will in a first section enumerate the disk overheads of the RAM based lookup
> strategy and then in the second section enumerate the additionals costs of 
> doing
> lookups in a disk based prefix b-tree.
>
> Comments and sugestions are welcome.
>
> I) RAM based lookups overhead
>
> The qcow2 read path is not modified by the deduplication patchset.
>
> Each cluster written gets its hash computed.
>
> Two GTrees are used to give access to the hashes : one indexed by hash and
> one other indexed by physical offset.

What is the GTree indexed by physical offset used for?

> I.0) unaligned write
>
> when a write is unaligned or smaller than a 4KB cluster the deduplication code
> issue one or two reads to get the missing data required to build a 4KB*n 
> linear
> buffer.
> The deduplication metrics code show that this situation don't happen with 
> virtio
> and ext3 as a guest partition.

If the application uses O_DIRECT inside the guest you may see <4 KB
requests even on ext3 guest file systems.  But in the buffered I/O
case the file system will use 4 KB blocks or similar.

>
> I.1) First write overhead
>
> The hash is computed
>
> the cluster is not duplicated so the hash is stored in a linked list
>
> after that the writev call get a new 64KB L2 dedup hash block corresponding to
> the physical sector of the writen cluster.
> (This can be an allocating write requiring to write the offset of the new 
> block
> in the dedup table and flush)
>
> The hash is written in the l2 dedup hash block and flushed later by the
> dedup_block_cache
>
> I.2) Same cluster rewrite at the same place
>
> The hash is computed
>
> qcow2_get_cluster_offset is called and the result is used to check that it is 
> a
> rewrite
>
> The cluster is counted as duplicated and not rewriten on disk

This case is when identical data is rewritten in place?  No writes are
required - this is the scenario where online dedup is faster than
non-dedup because we avoid I/O entirely.

>
> I.3) First duplicated cluster write
>
> The hash is computed
>
> qcow2_get_cluster_offset is called and we see that we are not rewriting the 
> same
> cluster at the same place
>
> I.3.a) The L2 entry of the first cluster written with this hash is overwritten
> to remove the QCOW_OFLAG_COPIED flag.
>
> I.3.b) the dedup hash block of the hash is overwritten to remember at the next
> startup that QCOW_OFLAG_COPIED has been cleared.
>
> A new L2 entry is created for this logical sector pointing to the physical
> cluster. (potential allocating write)
>
> the refcount of the physical cluster is updated
>
> I.4) Duplicated clusters further writes
>
> Same as I.2 without I.3.a and I.3.b
>
> I.5) cluster removal
> When a L2 entry to a cluster become stale the qcow2 code decrement the
> refcount.
> When the refcount reach zero the L2 hash block of the stale cluster
> is written to clear the hash.
> This happen often and require the second GTree to find the hash by it's 
> physical
> sector number

This happens often?  I'm surprised.  Thought this only happens when
you delete snapshots or resize the image file?  Maybe I misunderstood
this case.

> I.6) max refcount reached
> The L2 hash block of the cluster is written in order to remember at next 
> startup
> that it must not be used anymore for deduplication. The hash is dropped from 
> the
> gtrees.

Interesting case.  This means you can no longer take snapshots
containing this cluster because we cannot track references :(.

Worst case: guest fills the disk with the same 4 KB data (e.g.
zeroes).  There is only a single data cluster but the refcount is
maxed out.  Now it is not possible to take a snapshot.

Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]