[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Re: Strategic decision: COW format
From: |
Chunqiang Tang |
Subject: |
Re: [Qemu-devel] Re: Strategic decision: COW format |
Date: |
Mon, 14 Mar 2011 17:32:15 -0400 |
> > > FVD's novel uses of the reference count table reduces the metadata
update
> > > overhead down to literally zero during normal execution of a VM.
This gets
> > > the bests of QCOW2's reference count table but without its
oeverhead. In
> > > FVD, the reference count table is only updated when creating a new
> > > snapshot or deleting an existing snapshot. The reference count table
is
> > > never updated during normal execution of a VM.
> >
> > Do you want to send out a break-down of the steps (and cost) involved
in doing:
> >
> > 1. Snapshot creation.
> > 2. Snapshot deletion.
> > 3. Opening an image with n snapshots.
> Here is a detailed description. Relevant to the discussion of snapshot,
FVD
> uses a one-level lookup table and a refcount table. FVD’s one-level
lookup
> table is very similar to QCOW2’s two-level lookup table, except that it
is
> much smaller in FVD, and is preallocated and hence contiguous in image.
>
> FVD’s refcount table is almost identical to that of QCOW2, but with a
key
> difference. An image consists of an arbitrary number of read-only
snapshots,
> and a single writeable image front, which is the current image state
perceived
> by the VM. Below, I will simply refer to the read-only snapshots as
snapshots,
> and refer to the “writeable image front” as “writeable-front.” QCOW2’s
> refcount table counts clusters that are used by either read-only
snapshots or
> writeable-front. Because writeable-front changes as the VM runs, QCOW2
needs
> to update the refcount table on the fast path of normal VM execution.
>
> By contrast, FVD’s refcount table only counts chunks that are used by
read-
> only snapshots, and does not count chunks used by write-front. This is
the key
> that allows FVD to entirely avoid updating the refcount table on the
fast path
> of normal VM execution.
>
> Below are the detailed steps for different operations.
>
> O1: Open an image with n snapshots.
>
> Let me introduce some basic concepts first. The storage allocation unit
in FVD
> is called a chunk (like cluster in QCOW2). The default chunk size is
1MB, like
> that in VDI (VMDK and Microsoft VHD use 2MB chunks). An FVD image file
is
> conceptually divided into chunks, where chunk 0 is the first 1MB of the
image
> file, chunk 1 is the second 1MB, … chunk j, …and so forth. The size of
an
> image file grow as needed, just like that of QCOW2. The refcount table
is a
> linear array “uint16_t refcount[]”. If a chunk j is referenced by s
different
> snapshots, then refcount[j] = s. If a new snapshot is created and this
new
> snapshot also uses chunk j, then refcount[j] is incremented to
refcount[j] = s+1.
>
> If all snapshots together use 1TB storage spaces, there are
1TB/1MB=1,000,000
> chunks, and the size of the refcount table is 2MB. Loading the entire
2MB
> refcount table from disk into memory takes about 15 milliseconds. If the
> virtual disk size perceived by the VM is also 1TB, FVD’s one-level
lookup
> table is 4MB. FVD’s one-level lookup table serves the same purpose as
QCOW2’s
> two-level lookup table, but FVD’s one-level table is much smaller and is
> preallocated and hence continuous in the image. Loading the entire 4MB
lookup
> table from disk into memory takes about 20 milliseconds. These numbers
mean
> that it is quite affordable to scan the entire tables at VM boot time,
> although the scan can also be avoided in FVD. The optimizations will be
described later.
>
> When opening an image with n snapshots, an unoptimized version of FVD
performs
> the following steps:
>
> O1: Load the entire 2MB reference count table from disk into memory.
This step
> takes about 15ms.
>
> O2: Load the entire 4MB lookup table from disk into memory. This step
takes about 20ms.
>
> O3: Use the two tables to build an in-memory data structure called
“free-
> chunk-bitmap.” This step takes about 2ms. The free-chunk-bitmap
identifies
> free chunks that are not used by either snapshots or writeable front,
and
> hence can be allocated for future writes. The size of the
free-chunk-bitmap is
> only 125KB for a 1TB disk, and hence the memory overhead is negligible.
The
> free-chunk-bitmap also supports trim operations. The free-chunk-bitmap
does
> not have to be persisted on disk as it can always be rebuilt easily,
although
> as an optimization it can be persisted on disk on VM shutdown.
>
> O4: Compare the refcount table and the lookup table to identify chunks
that
> are in both tables (i.e., shared) and hence the running VM’s write to
those
> chunks in writeable-front triggers copy-on-write. This step takes about
2ms.
> One bit in the lookup table’s entry is stolen to mark whether a chunk in
> writeable-front is shared with snapshots and hence needs copy-on-write
upon a write.
>
> The whole process above, i.e., opening an image with n (e.g., n=1000)
> snapshots, takes about 39ms and it is a one-time cost at VM boot. Later,
I
> will describe optimizations that can further reduce this 39ms by saving
the
> 125KB free-chunk-bitmap to disk on VM shutdown, but that optimization is
more
> than likely to an over-engineering effort, given that 39ms on VM boot
perhaps
> is not a big deal.
>
> Once the image is opened, the refcount table is discarded and never
updated
> during normal execution of the VM. This is how FVD gets the bests of
QCOW2’s
> refcount table but without its runtime overhead. When the VM issues a
write to
> a chunk in writeable-front and this chunk is shared with snapshots (this
is
> known by looking at the special marker bit in lookup table entries), FVD
uses
> the free-chunk-bitmap to find a free chunk, performs copy-on-write, and
> updates the lookup table to point to that new chunk. This metadata
change to
> the lookup table is persisted in FVD’s journal and is not updated
directly to
> the writeable-front’s on-disk lookup table. When the VM shuts down
gracefully,
> the entire lookup table is written back to the writeable-front’s on-disk
> lookup table (this step takes about 20ms), and FVD’s image header is
updated
> to indicate that there is no need to recover from journal on the next VM
boot.
>
> Now let me describe an optimization that can further reduce the 39ms
time
> needed for opening an image with n snapshots. When the VM shuts down
> gracefully, FVD can save the in-memory 125KB free-chunk-bitmap to disk.
When
> the VM reboots, FVD can load the 125KB free-chunk-bitmap from disk and
skips
> steps O1, O2, and O3. Since the lookup table is saved to disk on a clean
> shutdown and one bit in the table entries marks whether a chunk is
shared with
> snapshots, the scanning step in O4 can also be avoided. In other words,
all
> the steps above O1-O4 can all be avoided on a clean shutdown. They are
needed
> only after a host crash. During the recovery process after a host crash,
the
> journal re-builds the lookup table; the refcount table and the lookup
table
> rebuild the free-chunk-bitmap.
>
> This is a description of FVD's image open operation.
>
> (…to be continued… I am running out of time now, and will write about
FVD’s
> other snapshot operations in separate emails.)
Let me continue my writeup to describe other snapshot related operations
in FVD.
Operation 2: Snapshot creation.
FVD performs the following operations when creating a new snapshot.
C1: Create a new snapshot by saving a copy of FVD's bitmap and lookup
table to a new location on disk. FVD's lookup table for a 1TB disk is 4MB
and the bitmap is even smaller. This step takes about 30ms.
C2: Load the 2MB refcount table from disk to memory. This step takes about
15ms.
C3: Use the writeable-front's 4MB lookup table to update the in-memory 2MB
refcount table. For a chunk j that exists in the lookup table, increment
the refcount table by doing refcount[j]++. This step is an in-memory
operation and takes about 2ms.
C4: Write the new 2MB refcount table to disk. This step takes about 15ms.
C5: Write the new list of snapshots to disk. This step takes about 10ms.
The snapshot creation process above takes about 72ms. Recall that on the
fast path during the VM's normal execution, FVD never updates the on-disk
refcount table and does not even keep the refcount table in memory.
Conceptually, updating the refcount table is shifted from the fast path of
VM execution to steps C2, C3, and C4 of the snapshot creation process. In
another perspective, this also allows FVD to batch all updates to the
refcount table and do it in a single, efficient, sequential write at the
time of snapshot creation. This is the trick why internal snapshot using
the refcount table is so efficient in FVD.
Operation 3: Snapshot deletion.
FVD performs the following operations when deleting a snapshot. Suppose
snapshot X is to be deleted.
D1: Load the 2MB refcount table from disk to memory. This step takes about
15ms.
D2: Load snapshot X's 4MB lookup table from disk. This step takes about
20ms.
D3: Use snapshot X's 4MB lookup table to update the in-memory 2MB refcount
table. For a chunk j that exists in snapshot X's lookup table, decrement
the refcount table by doing refcount[j]--. This step is an in-memory
operation and takes about 2ms.
C4: Write the new 2MB refcount table to disk. This step takes about 15ms.
C5: Write the new list of snapshots to disk. This step takes about 10ms.
The snapshot deletion process above takes about 62ms.
Operation 4: Go to a snapshot, i.e., derive a new writeable-front based on
a snapshot.
FVD performs the following operations when going to a snapshot X.
C1: Load snapshot X's bitmap and lookup table from disk into memory, and
save a new copy on disk, which will be the on-disk metadata for the new
writeable-front. FVD's lookup table for a 1TB disk is 4MB and the bitmap
is even smaller. This step takes about 60ms.
C2: Update the FVD image's header to point to the new bitmap and the new
lookup table. This step takes 7ms.
C3: Follow the image open operation, which take about 39ms in total.
The goto snapshot process above takes about 106ms. Note that when going to
a snapshot, FVD need not update the refcount table, because the refcount
table only counts read-only snapshots, which does not change during a goto
snapshot operation.
In summary, snapshot creation, snapshot deletion, and snapshot goto are
all fast operations in FVD, taking 106ms or less. Most importantly, on the
fast path during the normal execution of the VM, FVD never updates the
on-disk refcount table and does not even keep the refcount table in
memory. The only overhead in FVD is the 125KB free-chunk-bitmap that must
be kept in memory, but the free-chunk-bitmap is already needed for
supporting trim and recovering leaked storage space on a host crash, even
if we do not introduce the snapshot feature into FVD. In other words, FVD
gets all the benefits of QCOW2's internal snapshot function but without
paying any of its overhead. I truly believe this is the ideal solution for
snapshot, going beyond the current state-of-the-art in QCOW2 and VMware.
Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang
- Re: [Qemu-devel] Re: Strategic decision: COW format, (continued)
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Stefan Hajnoczi, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Message not available
- Re: [Qemu-devel] Re: Strategic decision: COW format,
Chunqiang Tang <=
- Re: [Qemu-devel] Re: Strategic decision: COW format, Anthony Liguori, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Anthony Liguori, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Chunqiang Tang, 2011/03/14
- Re: [Qemu-devel] Re: Strategic decision: COW format, Stefan Hajnoczi, 2011/03/14
Re: [Qemu-devel] Re: Strategic decision: COW format, Kevin Wolf, 2011/03/14