Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Anthony Liguori
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Tue, 07 Sep 2010 15:41:55 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.11) Gecko/20100713 Lightning/1.0b1 Thunderbird/3.0.6

On 09/07/2010 02:25 PM, Blue Swirl wrote:

On Mon, Sep 6, 2010 at 10:04 AM, Stefan Hajnoczi
<address@hidden>  wrote:

QEMU Enhanced Disk format is a disk image format that forgoes features
found in qcow2 in favor of better levels of performance and data
integrity.  Due to its simpler on-disk layout, it is possible to safely
perform metadata updates more efficiently.

Installations, suspend-to-disk, and other allocation-heavy I/O workloads
will see increased performance due to fewer I/Os and syncs.  Workloads
that do not cause new clusters to be allocated will perform similar to
raw images due to in-memory metadata caching.

The format supports sparse disk images.  It does not rely on the host
filesystem holes feature, making it a good choice for sparse disk images
that need to be transferred over channels where holes are not supported.

Backing files are supported so only deltas against a base image can be
stored.

The file format is extensible so that additional features can be added
later with graceful compatibility handling.

Internal snapshots are not supported.  This eliminates the need for
additional metadata to track copy-on-write clusters.

It would be nice to support external snapshots, so another file
besides the disk images can store the snapshots. Then snapshotting
would be available even with raw or QED disk images. This is of course
not QED specific.

There's two types of snapshots that I think can cause confusion.There's CPU/device state snapshots and then there's a block device snapshot.

qcow2 and qed both support block device snapshots. qed only supportsexternal snapshots (via backing_file) whereas qcow2 supports externaland internal snapshots. The internal snapshots are the source of anincredible amount of complexity in the format.

qcow2 can also store CPU/device state snapshots and correlate them toblock device snapshots (within a single block device). It only supportsdoing non-live CPU/device state snapshots.

OTOH, qemu can support live snapshotting via live migration. Today, itcan be used to snapshot CPU/device state to a file on the filesystemwith minimum downtime.

Combined with an external block snapshot and correlating data, thiscould be used to implement a single "snapshot" command that would behavelike savevm but would not pause a guest's execution.

It's really just a matter of plumbing to expose an interface for thistoday. We have all of the infrastructure we need.

+ *
+ * +--------+----------+----------+----------+-----+
+ * | header | L1 table | cluster0 | cluster1 | ... |
+ * +--------+----------+----------+----------+-----+
+ *
+ * There is a 2-level pagetable for cluster allocation:
+ *
+ *                     +----------+
+ *                     | L1 table |
+ *                     +----------+
+ *                ,------'  |  '------.
+ *           +----------+   |    +----------+
+ *           | L2 table |  ...   | L2 table |
+ *           +----------+        +----------+
+ *       ,------'  |  '------.
+ *  +----------+   |    +----------+
+ *  |   Data   |  ...   |   Data   |
+ *  +----------+        +----------+
+ *
+ * The L1 table is fixed size and always present.  L2 tables are allocated on
+ * demand.  The L1 table size determines the maximum possible image size; it
+ * can be influenced using the cluster_size and table_size values.

The formula for calculating the maximum size would be nice.


table_entries = (table_size * cluster_size / 8)
max_size = (table_entries) * table_entries * cluster_size

it's a hell of a lot easier to do powers-of-two math though:

table_entries = 2^2 * 2^16 / 2^3 = 2^15
max_size = 2^15 * 2^15 * 2^16 = 2^46 = 64TB

  Is the
image_size the limit?

No.

  How many clusters can there be?


table_entries * table_entries

  What happens if
the image_size is not equal to multiple of cluster size?


The code checks this and fails at open() or create() time.

  Wouldn't
image_size be redundant if cluster_size and table_size determine the
image size?

In a two level table, if you make table_size the determining factor, theimage has to be a multiple of the space spanned by the L2 tables whichin the default case for qed is 2GB.

+ *
+ * All fields are little-endian on disk.
+ */
+
+typedef struct {
+    uint32_t magic;                 /* QED */
+
+    uint32_t cluster_size;          /* in bytes */

Doesn't cluster_size need to be a power of two?


Yes.  It's enforced at open() and create() time but needs to be in the spec.

+    uint32_t table_size;            /* table size, in clusters */
+    uint32_t first_cluster;         /* first usable cluster */

This introduces some limits to the location of first cluster, with 4k
clusters it must reside within the first 16TB. I guess it doesn't
matter.

first_cluster is a bad name. It should be header_size and yeah, thereis a limit on header_size.

+
+    uint64_t features;              /* format feature bits */
+    uint64_t compat_features;       /* compatible feature bits */
+    uint64_t l1_table_offset;       /* L1 table offset, in bytes */
+    uint64_t image_size;            /* total image size, in bytes */
+
+    uint32_t backing_file_offset;   /* in bytes from start of header */
+    uint32_t backing_file_size;     /* in bytes */
+    uint32_t backing_fmt_offset;    /* in bytes from start of header */
+    uint32_t backing_fmt_size;      /* in bytes */
+} QEDHeader;
+
+typedef struct {
+    uint64_t offsets[0];            /* in bytes */
+} QEDTable;

Is this for both L1 and L2 tables?


Yes, which has the nice advantage of simplifying the code quite a bit.

Regards,

Anthony Liguori

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, (continued)
- Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Avi Kivity, 2010/09/07
- Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Blue Swirl, 2010/09/07
  - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori <=
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Kevin Wolf, 2010/09/08
  - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Stefan Hajnoczi, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Blue Swirl, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Blue Swirl, 2010/09/08
    - Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/08
- [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Michael S. Tsirkin, 2010/09/15
  - Re: [Qemu-devel] Re: [RFC] qed: Add QEMU Enhanced Disk format, Anthony Liguori, 2010/09/15
- Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, Khoa Huynh, 2010/09/16

Prev by Date: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by Date: Re: [Qemu-devel] [PATCH 4/4] PPC: Change PPC maintainer
Previous by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread