[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From: Anthony Liguori
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Tue, 07 Sep 2010 15:41:55 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100713 Lightning/1.0b1 Thunderbird/3.0.6

On 09/07/2010 02:25 PM, Blue Swirl wrote:
On Mon, Sep 6, 2010 at 10:04 AM, Stefan Hajnoczi
<address@hidden>  wrote:
QEMU Enhanced Disk format is a disk image format that forgoes features
found in qcow2 in favor of better levels of performance and data
integrity.  Due to its simpler on-disk layout, it is possible to safely
perform metadata updates more efficiently.

Installations, suspend-to-disk, and other allocation-heavy I/O workloads
will see increased performance due to fewer I/Os and syncs.  Workloads
that do not cause new clusters to be allocated will perform similar to
raw images due to in-memory metadata caching.

The format supports sparse disk images.  It does not rely on the host
filesystem holes feature, making it a good choice for sparse disk images
that need to be transferred over channels where holes are not supported.

Backing files are supported so only deltas against a base image can be

The file format is extensible so that additional features can be added
later with graceful compatibility handling.

Internal snapshots are not supported.  This eliminates the need for
additional metadata to track copy-on-write clusters.
It would be nice to support external snapshots, so another file
besides the disk images can store the snapshots. Then snapshotting
would be available even with raw or QED disk images. This is of course
not QED specific.

There's two types of snapshots that I think can cause confusion. There's CPU/device state snapshots and then there's a block device snapshot.

qcow2 and qed both support block device snapshots. qed only supports external snapshots (via backing_file) whereas qcow2 supports external and internal snapshots. The internal snapshots are the source of an incredible amount of complexity in the format.

qcow2 can also store CPU/device state snapshots and correlate them to block device snapshots (within a single block device). It only supports doing non-live CPU/device state snapshots.

OTOH, qemu can support live snapshotting via live migration. Today, it can be used to snapshot CPU/device state to a file on the filesystem with minimum downtime.

Combined with an external block snapshot and correlating data, this could be used to implement a single "snapshot" command that would behave like savevm but would not pause a guest's execution.

It's really just a matter of plumbing to expose an interface for this today. We have all of the infrastructure we need.

+ *
+ * +--------+----------+----------+----------+-----+
+ * | header | L1 table | cluster0 | cluster1 | ... |
+ * +--------+----------+----------+----------+-----+
+ *
+ * There is a 2-level pagetable for cluster allocation:
+ *
+ *                     +----------+
+ *                     | L1 table |
+ *                     +----------+
+ *                ,------'  |  '------.
+ *           +----------+   |    +----------+
+ *           | L2 table |  ...   | L2 table |
+ *           +----------+        +----------+
+ *       ,------'  |  '------.
+ *  +----------+   |    +----------+
+ *  |   Data   |  ...   |   Data   |
+ *  +----------+        +----------+
+ *
+ * The L1 table is fixed size and always present.  L2 tables are allocated on
+ * demand.  The L1 table size determines the maximum possible image size; it
+ * can be influenced using the cluster_size and table_size values.
The formula for calculating the maximum size would be nice.

table_entries = (table_size * cluster_size / 8)
max_size = (table_entries) * table_entries * cluster_size

it's a hell of a lot easier to do powers-of-two math though:

table_entries = 2^2 * 2^16 / 2^3 = 2^15
max_size = 2^15 * 2^15 * 2^16 = 2^46 = 64TB

  Is the
image_size the limit?


  How many clusters can there be?

table_entries * table_entries

  What happens if
the image_size is not equal to multiple of cluster size?

The code checks this and fails at open() or create() time.

image_size be redundant if cluster_size and table_size determine the
image size?

In a two level table, if you make table_size the determining factor, the image has to be a multiple of the space spanned by the L2 tables which in the default case for qed is 2GB.

+ *
+ * All fields are little-endian on disk.
+ */
+typedef struct {
+    uint32_t magic;                 /* QED */
+    uint32_t cluster_size;          /* in bytes */
Doesn't cluster_size need to be a power of two?

Yes.  It's enforced at open() and create() time but needs to be in the spec.

+    uint32_t table_size;            /* table size, in clusters */
+    uint32_t first_cluster;         /* first usable cluster */
This introduces some limits to the location of first cluster, with 4k
clusters it must reside within the first 16TB. I guess it doesn't

first_cluster is a bad name. It should be header_size and yeah, there is a limit on header_size.

+    uint64_t features;              /* format feature bits */
+    uint64_t compat_features;       /* compatible feature bits */
+    uint64_t l1_table_offset;       /* L1 table offset, in bytes */
+    uint64_t image_size;            /* total image size, in bytes */
+    uint32_t backing_file_offset;   /* in bytes from start of header */
+    uint32_t backing_file_size;     /* in bytes */
+    uint32_t backing_fmt_offset;    /* in bytes from start of header */
+    uint32_t backing_fmt_size;      /* in bytes */
+} QEDHeader;
+typedef struct {
+    uint64_t offsets[0];            /* in bytes */
+} QEDTable;
Is this for both L1 and L2 tables?

Yes, which has the nice advantage of simplifying the code quite a bit.


Anthony Liguori

reply via email to

[Prev in Thread] Current Thread [Next in Thread]