qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] QCOW2 deduplication


From: Benoît Canet
Subject: Re: [Qemu-devel] QCOW2 deduplication
Date: Wed, 27 Feb 2013 16:58:03 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

> > The current prototype of the QCOW2 deduplication uses 32 bytes SHA256 or 
> > SKEIN
> > hashes to identify each 4KB clusters with a very low probability of 
> > collisions.
> 
> How do you handle the rare collision cases? Do you read the original
> cluster and compare the exact contents when the hashes match?

Stefan found a paper with the math required to compute the collision
probability: http://http://plan9.bell-labs.com/sys/doc/venti/venti.html
             (Section 3.1)
Doing the math for 1 Exabyte of stored data with 4KB clusters and 256 bits
hashes gives a probability of 2.57E-49.
The probability being low enough I plan to code the read/compare as an
option that the users would toggle.
The people who wrote the deduplication in ZFS have done it this way.

> 
> > The 4KB cluster size was choosen because it's the block size of most of the
> > filesytems that will run on guests.
> > The current prototype shows that very high deduplication ratios are doable
> > with 4KB clusters.
> 
> Is this the qcow2 cluster size or is it independent from it? Is 4k a
> must or is it just what you think is a good default?

It's the qcow2 cluster size.
It's a must because most guest filesystem use 4KB blocks.
Using 4KB cluster size avoid to be forced to to some extra data read to complete
unaligned/partial writes from the guest.
(Imagine a guest doing 4KB writes while the dedup need 64KB of data to compute
a single hash).

> 
> > The SKEIN implementation I am linking to goes at 400MB/s on a modern machine
> > so it's not the bottleneck.
> > 
> > The deduplication code does not modify the read path.
> > 
> > The deduplication code is hooked in qcow2_co_writev in three places.
> > 
> > qcow2_dedup_read_missing_and_concatenate:
> > In the first place at most two read are issued to build a buffer with a size
> > which is a multiple of cluster size.
> > These two reads will be at the begining and at the ending of the buffer.
> > In practice guest filesystems use 4KB clusters so no unaligned write is done
> > and these two reads are not done.
> 
> If you do read the cluster, how do you protect against concurrent write
> requests to same cluster? Just hold s->lock or something finer grained?

I have s->dedup_lock which is took at the begining of qcow2_co_writev in the
deduplication case in order to serialize writes and avoid the concurent write
case.

An alternative would be to use something like wait_for_overlapping_request in
the block layer to deal with this.
Thus I don't know if this is a better solution.

> 
> > qcow2_dedup:
> > As we discussed on irc a few month ago this function tries to deduplicate as
> > much consecutives clusters as it can.
> > It then will skip all the non deduplicables clusters and locate the next 
> > cluster
> > that can be deduplicated.
> > This way qcow_co_writev will knows exactly how much data it has to write.
> > 
> > qcow2_dedup_store_new_hashes:
> > This function is to be called after the actual new data is written to disk 
> > in
> > order to make the hashes persistents.
> 
> What happens if the host crashes between writing the data and updating
> the hashes?

There is three cases : it's a write or the dedup kick in or
it's a rewrite of a physical block.

If it's a write it's fine this cluster content will not be deduplicated next 
time
the dedup see it.

If it trigger the dedup it will just increase the refcount and write the l2
entry.

If it's a rewrite it's more annoying and I think that doing allocating writes
on the rewrite case solves this because the hash doesn't need to be updated
anymore.

> 
> > When qcow2_alloc_cluster_link_l2 is freeing a block it's hash is deleted 
> > from
> > disk.
> 
> qcow2_alloc_cluster_link_l2() frees clusters only if they had a
> refcount > 1 and COW was needed. The cluster still has a refcount > 0
> after one reference was dropped, so there's no reason to delete the
> hash.
Yes.

> 
> > When the refcount of a block exceed 2^16/2 the hash of this block is tagged 
> > and
> > forgotten so there will be rooms to take qcow2 snapshots.
> > 
> > Most of the code is done in qcow2-dedup.c
> > 
> > The current implementation store the hashes in RAM in two glib GTrees.
> > The first one indexed by hash and the second one indexed by cluster index.
> > 
> > On disk the hashes are stored on a sparse L1/L2 table indexed by physical
> > clusters sectors.
> > 
> > This organization works well for images of a few GB but as Stefan says it 
> > will
> > not be sufficient for TB sized images because the GTrees eat too much
> > RAM.
> 
> Yes, we can't keep everything in RAM for large images.
> 
> What is the physical offset -> hash mapping actually used for?

Cluster deletion, half refcount overflow and physical cluster rewriting.
The last case is the most difficult to get rid of.
It occurs when a rewrite will change the content of a physical cluster and make
a hash obsolete.
The code needs to get rid of this hash given the physical index of the cluster.

> 
> > Alternatives disk data structures.
> > ----------------------------------
> > 
> > I studied a bit the classic data structures that could be used and even
> > prototyped one to see how it behave.
> > 
> > The first class of data structures that could be used to store the hashes
> > is the b-tree familly.
> > Its well described in textbooks but needs at most O(log(n)) 4KB reads to 
> > lookup
> > only one hash.
> > The implementation seems a bit complex and writting a new hash would also 
> > need
> > O(log(n)) IO.
> > It would make the deduplication badly performing.
> 
> n being the number of allocated clusters in the image, right? And log
> the logarithm with base m, with m being the number of elements in one
> tree node?
> 
> This doesn't really sound _that_ bad. In practice log(n) would be like 3
> or 4 then. The first two levels could probably be in memory without
> problems.
I agree on the top level caching.

The problem is elsewhere.

As we really need 4KB clusters and we can't do any read or write with size
smaller than 4KB we need to keep the data IO/metadata IO ratio higher than 1.

Consider that a single 4KB write would generate a few 4KB lookup and a 4KB
write and in the worst case of a leaf/splitting and O(log(n)) node splitting
writes toward the top of the tree.

> 
> The problem that I really see is that with hashes you don't really have
> any locality that caches for all levels could use, so you don't get
> around at least one disk read for each request.
> 
> > The second class of data structures that could be used are hashes.
> > The promise is 0(1) lookup and insertion.
> > The first shortcoming is that they don't grow well.
> > Extendible hash (http://goo.gl/GiB7A) try to address the growth issues.
> > The second shortcoming is that every new 4KB block stored on disk would 
> > generate
> > a new hash that would need 1 random 4KB read + 1 random 4KB write to be 
> > stored.
> > With SSD storage being more and more used and not being good at random write
> > it makes this solution less attractive.
> 
> O(1)? What about collisions?
Extendible hash is done in two part a directory indexed by a bit prefix or
suffix of the key and large buckets absorbing collisions.

Having a bucket full trigger a restructuration of the directory and the
splitting of one bucket.

It seems to be possible to spread the restructuration in an incremental way.

So collisions are not really a problem.

The problem is that in order to keep the directory small we would need to use
64KB bucket and it would inverse the data IO/metadata IO ratio.
I prototyped extensible hashing and found it would slow down writes.

> 
> > I finally found this patent free paper : SILT (http://goo.gl/AG493).
> > 
> > In a first step it store new hashes in a log hence making hash writes
> > sequentials and fast at the cost of some RAM.
> > In a second step it reorganize the last log into a hash table consuming 2 
> > bytes
> > of RAM per hash.
> > Then in a third step it merge the seconds step hash tables into a very 
> > efficient
> > store.
> > 
> > The full SILT implemetation available on github is 15k sloc of C++.
> > I think that most of the complexity comes from the third step.
> > 
> > I am proposing to build the QCOW2 deduplication hash store on the first two
> > steps of SILT to get most of the benefits without the complexity.
> 
> This makes me wonder if we should look into more general journaling for
> all parts of qcow2.
> 
> > The RAM requirement would be approx 4 bytes per 4KB of data stored on disk.
>

> What exactly are these 4k? Keep in mind that we could use data
> structures that occupy more than one cluster if it helps - like QED used
> four clusters for each L2 table by default.
These 4KB are a qcow2 cluster of actual data written by the guest.
With the deduplication each 4KB cluster need it's own 32byte hash + two uint64_t
offset to be stored and indexed.

> 
> > Each of the second step hash table would have a small hash filter stored in 
> > ram.
> > 
> > These hash filters would be interleaved in order to make their traversal 
> > cache
> > friendly.
> > 
> > This would need to linearly scan the hash filters at the speed of RAM
> > in order to know which hash table must be queried.
> > 
> > The cost to lookup a hash would almost O(1) random read.
> > The cost to write a hash is amortized as it's written in a log.
> > 
> > To keep the risk of false positive low with the growing hash filter list
> > I would use 4 bytes hash filter entries instead of 2 bytes ones.
> > This would use 1GB of RAM for 1TB worth of stored hash.
> > 
> > As the two first step of SILT are the best data structures I found while
> > digging this topic could you give your feelings about it ?
> 
> I haven't really looked at it yet, but as you describe it, the log seems
> to be the crucial element. I think it can be combined with every on-disk
> data structure, in fact. So if what they use fits well, sure, use it,
> but if it doesn't, we can still get the write cost down this way.

Yes the log totally get rid of the random writes and make the data IO/metadata
IO ratio fine when the guest write new data.

The only real extra cost would be a 4KB read in one of the hash table.
A small extra cost would be the linear scanning of the hash table filter stored 
in RAM.

> 
> > Removal of the second lookup data structure
> > -------------------------------------------
> > 
> > Needing two data structures on disk for deduplication is annoying because 
> > they
> > need to be synchronized and it may required a third structure (journal).
> > 
> > The second data structure is used to lookup a hash by it's physical cluster
> > sectors.
> > 
> > It used in a few place in the code:
> > 
> > When a cluster is freed (by qcow2_alloc_cluster_link_l2) ,
> > when a cluster is overwritten
> > When a cluster has reached half of the max refcount.
> > 
> > The half max refcount case already know the hash value so the by sector
> > lookup can be avoided.
> > 
> > The cluster freed case could read the cluster and recompute the hash 
> > solving the
> > issue at the cost of an extra 4KB read.
> > 
> > The overwritten case can probably be solved if overwriting writes became
> > allocating writes. This way the existing cluster is leaved in place and 
> > will be
> > freed later.
> > 
> > Is it thinkable that overwriting writes be changed to allocating writes in 
> > the
> > deduplication case ?
> 
> Allocating writes come with extra overhead and complexity. I'm not sure
> if it's really advisable to do this. But if you think it's the best way
> to solve the problem, I'm not opposed to it.

Ok I will test if it works at all by patching the current code.

Best regards

Benoît
> 
> Kevin
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]