[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Re: Strategic decision: COW format
From: |
Chunqiang Tang |
Subject: |
Re: [Qemu-devel] Re: Strategic decision: COW format |
Date: |
Tue, 22 Feb 2011 22:32:38 -0500 |
> In any case, the next step is to get down to specifics. Here is the
> page with the current QCOW3 roadmap:
>
> http://wiki.qemu.org/Qcow3_Roadmap
>
> Please raise concrete requirements or features so they can be
> discussed and captured.
Now it turns into a more productive discussion, but it seems to lose the
big picture too quickly and has gone too narrowly into issues like the
“dirty bit”. Let’s try to answer a bigger question: how to take a holistic
approach to address all the factors that make a virtual disk slower than a
physical disk? Even if issues like the “dirty bit” are addressed
perfectly, they may still only be a small part of the total solution. The
discussion of internal snapshot is at the end of this email.
Compared with a physical disk, a virtual disk (even RAW) incurs some or
all of the following overheads. Obviously, the way to achieve high
performance is to eliminate or reduce these overheads.
Overhead at the image level:
I1: Data fragmentation caused by an image format.
I2: Overhead in reading an image format’s metadata from disk.
I3: Overhead in writing an image format’s metadata to disk.
I4: Inefficiency and complexity in the block driver implementation, e.g.,
waiting synchronously for reading or writing metadata, submitting I/O
requests sequentially when they should be done concurrently, performing a
flush unnecessarily, etc.
Overhead at the host file system level:
H1: Data fragmentation caused by a host file system.
H2: Overhead in reading a host file system’s metadata.
H3: Overhead in writing a host file system’s metadata.
Existing image formats by design do not address many of these issues,
which is the reason why FVD was invented (
http://wiki.qemu.org/Features/FVD). Let’s look at these issues one by
one.
Regarding I1: Data fragmentation caused by an image format:
This problem exists in most image formats, as they insist on doing storage
allocation for the second time at the image level (including QCOW2, QED,
VMDK, VDI, VHD, etc.), even if the host file system already does storage
allocation. These image formats unnecessarily mix the function of storage
allocation with the function of copy-on-write, i.e., they determine
whether a cluster is dirty by checking whether it has storage space
allocated at the image level. This is wrong. Storage allocation and
tracking dirty clusters are two separate functions. Data fragmentation at
the image level can be totally avoided by using a RAW image plus a bitmap
header to indicate whether clusters are dirty due to copy-on-write. FVD
can be configured to take this approach, although it can also be
configured to do storage allocation. Doing storage allocation at the
image level can be optional, but should never be mandatory.
Regarding I2: Overhead in reading an image format’s metadata from disk:
Obviously, the solution is to make the metadata small so that it can be
cached entirely in memory. Is this aspect, QCOW1/QCOW2/QED and
VMDK-workstation-version are wrong, and VirtualBox VDI, Microsoft VHD, and
VMDK-esx-server-version are right. With QCOW1/QCOW2/QED, for a 1TB virtual
disk, the metadata size is at least 128MB. By contrast, with VDI, for a
1TB virtual disk, the metadata size is only 4MB. The “wrong formats” all
use a two-level lookup table to do storage allocation at a small
granularity (e.g., 64KB), whereas the “right formats” all use a one-level
lookup table to do storage allocation at a large granularity (1MB or 2MB).
The one-level table is easier to implementation. Note that VMware VMDK
started wrong in VMware’s workstation version, and then was corrected to
be right in the ESX server version, which is a good move. As virtual disks
grow bigger, it is likely that the storage allocation unit will be
increased in the future, e.g., to 10MB or even larger. In existing image
formats, one limitation of using a large storage allocation unit is that
it forces copy-on-write being performed on a large cluster (e.g., 10MB in
the future), which is sort of wrong. FVD gets the bests of both worlds. It
uses a one-level table to perform storage allocation at a large
granularity, but uses a bitmap to track copy-on-write at a smaller
granularity. For a 1TB virtual disk, this approach needs only 6MB
metadata, slightly larger than VDI’s 4MB.
Regarding I3: Overhead in writing an image format’s metadata to disk:
This is where the “dirty bit” discussion fits, but FVD goes way beyond
that to reduce metadata updates. When an FVD image is fully optimized
(e.g., the one-level lookup table is disabled and the base image is
reduced to its minimum size), FVD has almost zero overhead in metadata
update and the data layout is just like a RAW image. More specifically,
metadata updates are skipped, delayed, batched, or merged as much as
possible without compromising data integrity. First, even with
cache=writethrough (i.e., O_DSYNC), all metadata updates are sequential
writes to FVD’s journal, which can be merged into a single write by the
host Linux kernel. Second, when cache!=writethrough, metadata updates are
batched and sent to the journal either on a flush, or memory pressure, or
periodically cleaned, just like page cache in kernel. Third, FVD’s table
can be (preferably) disabled and hence it incurs no update overhead. Even
if the table is enabled, FVD’s chunk is much larger than QCOW2/QED’s
cluster, and hence needs less updates. Finally, although QCOW2/QED and FVD
use the same block/cluster size, FVD can be optimized to eliminate most
bitmap updates with several techniques: A) Use resize2fs to reduce the
base image to its minimum size (which is what a Cloud can do) so that most
writes occur at locations beyond the size of the base image, without the
need to update the bitmap; B) ‘qemu-img create’ can find zero-filled
sectors in a sparse base image and preset the corresponding bits of
bitmap, which then requires no runtime update; and C) copy-on-read and
prefetching do not update the bitmap and once prefetching finishes, there
is completely no need for FVD to read or write the bitmap. Again, when an
FVD image is fully optimized (e.g., the table is disabled and the base
image is reduced to its minimum size), FVD has almost zero overhead in
metadata update and the data layout is just like a RAW image.
Regarding I4: Inefficiency in block driver, e.g., synchronous metadata
read/write:
Today, FVD is the only fully asynchronous, nonblocking COW driver
implemented for QEMU, and has the best performance. This is partially due
to its simple design. The one-level table is easy to implement than a
two-level table. The journal avoids sophisticated locking that would
otherwise be required for performing metadata updates. FVD parallelizes
I/Os to the maximum degree possible. For example, if processing a
VM-generated read request needs to read data from the base image as well
as several non-continuous chunks in the FVD image, FVD issues all I/O
requests in parallel rather than sequentially.
Regarding H1&H2&H3: host file system caused fragmentation and metadata
read/write:
FVD can be optionally configured to get rid of the host file system and
store an image on a logical volume directly. This seems straightforward
but a naïve solution like that currently in QCOW2 would not be able to
achieve storage thin provisioning (i.e., storage over-commit), as the
initial logical volume size need be allocated to the full size of the
image. FVD supports thin provisioning on a logical volume, by starting
with a small one and growing it automatically when needed. It is quite
easy for FVD to track the size of used space, without the need to update a
size field in the image header on every storage allocation (which is a
problem in VDI). There are multiple efficient solutions possible in FVD.
One solution is to piggyback the size field as part of the journal entry
that records a new storage allocation. Alternatively, even doing an ‘fsck’
like scan on FVD’s one-level lookup table to figure out the used space is
trivial. Because the table is only 4MB for a 1TB virtual disk and it is
contiguous in the image, a scan takes only about 20 milliseconds: 15
milliseconds to load 4MB from disk and less than 5 milliseconds to scan
4MB in memory. This is more efficient than a dirty bit in QCOW2 or QED.
In summary, it seems that people’s imagination for QCOW3 is unfortunately
limited by the overwhelming experience from QCOW2, without even looking at
what VirtualBox VDI, VMware VMDK, and Microsoft VHD have done, not to
mention going beyond all those to ascend to the next level. Regardless of
its name, I hope QCOW3 will take the right actions to fix wrong things in
QCOW2, including:
A1: abandon a two-level table and adopt a one-level table, as that in VDI,
VMDK, and VHD, for simplicity and much smaller metadata size.
A2: introduce a bitmap to allow copy-on-write without doing storage
allocation, which 1) avoids image-level fragmentation, 2) eliminates
metadata update overhead for storage allocation, 3) allows copy-on-write
being performed on a smaller storage unit (64KB) while still having very
small metadata size.
A3: introduce a journal to batch and merge metadata updates and to reduce
fsck recovery time after a host crash.
This is exactly the process how I arrived at the design of FVD. It is not
by chance, but instead by taking a holistic approach to analyze problems
in a virtual disk. I think the status of “QCOW3” today is comparable to
FVD’s status 10 months ago when the design started to emerge, but FVD’s
implementation today is very mature. It is the only asynchronous,
nonblocking COW driver implemented for QEMU with undoubtedly the best
performance, both by design and by implementation.
Now let’s talk about features. It seems that there is great interest in
QCOW2’ internal snapshot feature. If we really want to do that, the right
solution is to follow VMDK’s approach of storing each snapshot as a
separate COW file (see http://www.vmware.com/app/vmdk/?src=vmdk ), rather
than using the reference count table. VMDK’s approach can be easily
implemented for any COW format, or even as a function of the generic block
layer, without complicating any COW format or hurting its performance. I
know the snapshots are not really “internal” as stored in a single file
but instead more like external snapshots, but users don’t care about that
so long as they support the same use cases. Probably many people who use
VMware don't even know that the snapshots are stored as separate files. Do
they care?
Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang
- Re: [Qemu-devel] Re: Strategic decision: COW format, (continued)
- Re: [Qemu-devel] Re: Strategic decision: COW format, Aurelien Jarno, 2011/02/20
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/02/21
- Re: [Qemu-devel] Re: Strategic decision: COW format, Stefan Hajnoczi, 2011/02/21
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/02/21
- Re: [Qemu-devel] Re: Strategic decision: COW format, Anthony Liguori, 2011/02/21
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/02/21
- Re: [Qemu-devel] Re: Strategic decision: COW format,
Chunqiang Tang <=
- Re: [Qemu-devel] Re: Strategic decision: COW format, Markus Armbruster, 2011/02/23
- Re: [Qemu-devel] Re: Strategic decision: COW format, Markus Armbruster, 2011/02/22
Re: [Qemu-devel] Re: Comparing New Image Formats: FVD vs. QED, Stefan Hajnoczi, 2011/02/16