[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From: Anthony Liguori
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date: Tue, 14 Sep 2010 07:54:12 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100826 Lightning/1.0b1 Thunderbird/3.0.7

On 09/14/2010 05:46 AM, Stefan Hajnoczi wrote:
On Fri, Sep 10, 2010 at 10:22 PM, Jamie Lokier<address@hidden>  wrote:
Stefan Hajnoczi wrote:
Since there is no ordering imposed between the data write and metadata
update, the following scenarios may occur on crash:
1. Neither data write nor metadata update reach the disk.  This is
fine, qed metadata has not been corrupted.
2. Data reaches disk but metadata update does not.  We have leaked a
cluster but not corrupted metadata.  Leaked clusters can be detected
with qemu-img check.
3. Metadata update reaches disk but data does not.  The interesting
case!  The L2 table now points to a cluster which is beyond the last
cluster in the image file.  Remember that file size is rounded down by
cluster size, so partial data writes are discarded and this case
Better add:

4. File size is extended fully, but the data didn't all reach the disk.
This case is okay.

If a data cluster does not reach the disk but the file size is
increased there are two outcomes:
1. A leaked cluster if the L2 table update did not reach the disk.
2. A cluster with junk data, which is fine since the guest has no
promise the data safely landed on disk without a completing a flush.

A flush is performed after allocating new L2 tables and before linking
them into the L1 table.  Therefore clusters can be leaked but an
invalid L2 table can never be linked into the L1 table.

5. Metadata is partially updated.
6. (Nasty) Metadata partial write has clobbered neighbouring
   metadata which wasn't meant to be changed.  (This may happen up
   to a sector size on normal hard disks - data is hard to come by.
   This happens to a much larger file range on flash and RAIDs
   sometimes - I call it the "radius of destruction").

6 can also happen when doing the L1 updated mentioned earlier, in
which case you might lose a much larger part of the guest image.
These two cases are problematic.

And not worth the hassle. It might matter if you've bought your C-Gate hard drives from a guy with a blanket on the street and you're sending your disk array on the space shuttle during a solar storm, but if you're building on top of file systems with reasonable storage, these are not reasonable failure scenarios to design for.

There's a place for trying to cover these types of scenarios to build reliable storage arrays on top of super cheap storage but that's not our mission. That's what the btrfs's of the world are for.


Anthony Liguori

   I've been thinking in atomic sector
updates and not in a model where updates can be partial or even
destructive at the byte level.  Do you have references where I can
read more about the radius of destruction ;)?

Transactional I/O solves this problem.  Checksums can detect but do
not fix the problem alone.  Duplicate metadata together with checksums
could be a solution but I haven't thought through the details.

Any other suggestions?

Time to peek at md and dm to see how they safeguard metadata.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]