qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] QCOW2 deduplication


From: Benoît Canet
Subject: [Qemu-devel] QCOW2 deduplication
Date: Tue, 26 Feb 2013 18:14:44 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

Hello Kevin,

As you are best person to discuss QCOW2 implementations issues with I am writing
this mail so you can know what has been done on deduplication and what I am
planning to do next.

In short I need your feedback before going into another code sprint and being in
need of another code review.

Current status
--------------

Also see (http://lists.gnu.org/archive/html/qemu-devel/2013-02/msg01260.html)

The current prototype of the QCOW2 deduplication uses 32 bytes SHA256 or SKEIN
hashes to identify each 4KB clusters with a very low probability of collisions.

The 4KB cluster size was choosen because it's the block size of most of the
filesytems that will run on guests.
The current prototype shows that very high deduplication ratios are doable
with 4KB clusters.

The SKEIN implementation I am linking to goes at 400MB/s on a modern machine
so it's not the bottleneck.

The deduplication code does not modify the read path.

The deduplication code is hooked in qcow2_co_writev in three places.

qcow2_dedup_read_missing_and_concatenate:
In the first place at most two read are issued to build a buffer with a size
which is a multiple of cluster size.
These two reads will be at the begining and at the ending of the buffer.
In practice guest filesystems use 4KB clusters so no unaligned write is done
and these two reads are not done.

qcow2_dedup:
As we discussed on irc a few month ago this function tries to deduplicate as
much consecutives clusters as it can.
It then will skip all the non deduplicables clusters and locate the next cluster
that can be deduplicated.
This way qcow_co_writev will knows exactly how much data it has to write.

qcow2_dedup_store_new_hashes:
This function is to be called after the actual new data is written to disk in
order to make the hashes persistents.

When qcow2_alloc_cluster_link_l2 is freeing a block it's hash is deleted from
disk.

When the refcount of a block exceed 2^16/2 the hash of this block is tagged and
forgotten so there will be rooms to take qcow2 snapshots.

Most of the code is done in qcow2-dedup.c

The current implementation store the hashes in RAM in two glib GTrees.
The first one indexed by hash and the second one indexed by cluster index.

On disk the hashes are stored on a sparse L1/L2 table indexed by physical
clusters sectors.

This organization works well for images of a few GB but as Stefan says it will
not be sufficient for TB sized images because the GTrees eat too much
RAM.

Alternatives disk data structures.
----------------------------------

I studied a bit the classic data structures that could be used and even
prototyped one to see how it behave.

The first class of data structures that could be used to store the hashes
is the b-tree familly.
Its well described in textbooks but needs at most O(log(n)) 4KB reads to lookup
only one hash.
The implementation seems a bit complex and writting a new hash would also need
O(log(n)) IO.
It would make the deduplication badly performing.

The second class of data structures that could be used are hashes.
The promise is 0(1) lookup and insertion.
The first shortcoming is that they don't grow well.
Extendible hash (http://goo.gl/GiB7A) try to address the growth issues.
The second shortcoming is that every new 4KB block stored on disk would generate
a new hash that would need 1 random 4KB read + 1 random 4KB write to be stored.
With SSD storage being more and more used and not being good at random write
it makes this solution less attractive.

I finally found this patent free paper : SILT (http://goo.gl/AG493).

In a first step it store new hashes in a log hence making hash writes
sequentials and fast at the cost of some RAM.
In a second step it reorganize the last log into a hash table consuming 2 bytes
of RAM per hash.
Then in a third step it merge the seconds step hash tables into a very efficient
store.

The full SILT implemetation available on github is 15k sloc of C++.
I think that most of the complexity comes from the third step.

I am proposing to build the QCOW2 deduplication hash store on the first two
steps of SILT to get most of the benefits without the complexity.

The RAM requirement would be approx 4 bytes per 4KB of data stored on disk.

Each of the second step hash table would have a small hash filter stored in ram.

These hash filters would be interleaved in order to make their traversal cache
friendly.

This would need to linearly scan the hash filters at the speed of RAM
in order to know which hash table must be queried.

The cost to lookup a hash would almost O(1) random read.
The cost to write a hash is amortized as it's written in a log.

To keep the risk of false positive low with the growing hash filter list
I would use 4 bytes hash filter entries instead of 2 bytes ones.
This would use 1GB of RAM for 1TB worth of stored hash.

As the two first step of SILT are the best data structures I found while
digging this topic could you give your feelings about it ?

Removal of the second lookup data structure
-------------------------------------------

Needing two data structures on disk for deduplication is annoying because they
need to be synchronized and it may required a third structure (journal).

The second data structure is used to lookup a hash by it's physical cluster
sectors.

It used in a few place in the code:

When a cluster is freed (by qcow2_alloc_cluster_link_l2) ,
when a cluster is overwritten
When a cluster has reached half of the max refcount.

The half max refcount case already know the hash value so the by sector
lookup can be avoided.

The cluster freed case could read the cluster and recompute the hash solving the
issue at the cost of an extra 4KB read.

The overwritten case can probably be solved if overwriting writes became
allocating writes. This way the existing cluster is leaved in place and will be
freed later.

Is it thinkable that overwriting writes be changed to allocating writes in the
deduplication case ?

Best regards

Benoît



reply via email to

[Prev in Thread] Current Thread [Next in Thread]