[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From: Anthony Liguori
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Tue, 07 Sep 2010 11:25:23 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100713 Lightning/1.0b1 Thunderbird/3.0.6

On 09/07/2010 11:09 AM, Avi Kivity wrote:
 On 09/07/2010 06:40 PM, Anthony Liguori wrote:

Need a checksum for the header.

Is that not a bit overkill for what we're doing?  What's the benefit?

Make sure we're not looking at a header write interrupted by a crash.

Couldn't hurt I guess. I don't think it's actually needed for L1/L2 tables FWIW.

The L2 link '''should''' be made after the data is in place on storage. However, when no ordering is enforced the worst case scenario is an L2 link to an unwritten cluster.

Or it may cause corruption if the physical file size is not committed, and L2 now points at a free cluster.

An fsync() will make sure the physical file size is committed. The metadata does not carry an additional integrity guarantees over the actual disk data except that in order to avoid internal corruption, we have to order the L2 and L1 writes.

I was referring to "when no ordering is enforced, the worst case scenario is an L2 link to an unwritten cluster". This isn't true - worst case you point to an unallocated cluster which can then be claimed by data or metadata.

Right, it's necessary to do an fsync to protect against this. To make this user friendly, we could have a dirty bit in the header which gets set on first metadata write and then cleared on clean shutdown.

Upon startup, if the dirty bit is set, we do an fsck.

We can remove this requirement by copying-on-write any metadata write, and keeping two copies of the header (with version numbers and checksums).

QED has a property today that all metadata or cluster locations have a single location on the disk format that is immutable. Defrag would relax this but defrag can be slow.

Having an immutable on-disk location is a powerful property which eliminates a lot of complexity with respect to reference counting and dealing with free lists.

However, it exposes the format to "writes may corrupt overwritten data".

No, you never write an L2 entry once it's been set. If an L2 entry isn't set, the contents of the cluster is all zeros.

If you write data to allocate an L2 entry, until you do a flush(), the data can either be what was written or all zeros.

For the initial design I would avoid introducing something like this. One of the nice things about features is that we can introduce multi-level trees as a future feature if we really think it's the right thing to do.

But we should start at a simple design with high confidence and high performance, and then introduce features with the burden that we're absolutely sure that we don't regress integrity or performance.

For most things, yes. Metadata checksums should be designed in though (since we need to double the pointer size).

Variable height trees have the nice property that you don't need multi cluster allocation. It's nice to avoid large L2s for very large disks.

FWIW, L2s are 256K at the moment and with a two level table, it can support 5PB of data. If we changed the tables to 128K, we could support 1PB and with 64K tables we would support 256TB.

So we could definitely reduce the table sizes now to be a single cluster and it would probably cover us for the foreseeable future.


Anthony Liguori

reply via email to

[Prev in Thread] Current Thread [Next in Thread]