Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Avi Kivity
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Thu, 09 Sep 2010 09:45:27 +0300
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.8) Gecko/20100806 Fedora/3.1.2-1.fc13 Thunderbird/3.1.2

 On 09/08/2010 03:48 PM, Anthony Liguori wrote:

On 09/08/2010 03:23 AM, Avi Kivity wrote:
 On 09/08/2010 01:27 AM, Anthony Liguori wrote:
FWIW, L2s are 256K at the moment and with a two level table, it cansupport 5PB of data.
I clearly suck at basic math today. The image supports 64TB today.Dropping to 128K tables would reduce it to 16TB and 64k tables wouldbe 4TB.
Maybe we should do three levels then. Some users are bound tocomplain about 64TB.
That's just the default size. The table size and cluster sizes areconfigurable. Without changing the cluster size, the image cansupport up to 1PB.

Loading very large L2 tables on demand will result in very longlatencies. Increasing cluster size will result in very long first writelatencies. Adding an extra level results in an extra random write every4TB.

Today, we only need to sync() when we first allocate an L2 entry(because their locations never change). From a performanceperspective, it's the difference between an fsync() every 64k vs.every 2GB.
Yup. From a correctness perspective, it's the difference between acorrupted filesystem on almost every crash and a corrupted filesystemin some very rare cases.
I'm not sure I understand you're corruption comment. Are you claimingthat without checksumming, you'll often get corruption or are youclaiming that without checksums, if you don't sync metadata updatesyou'll get corruption?

No, I'm claiming that with checksums but without allocate-on-write youwill have frequent (detected) data loss after power failures. Checksumsneed to go hand-in-hand with allocate-on-write (which happens to be theprinciple underlying zfs and btrfs).

qed is very careful about ensuring that we don't need to do syncs andwe don't get corruption because of data loss. I don't necessarily buyyour checksumming argument.

The requirement for checksumming comes from a different place. Fordecades we've enjoyed very low undetected bit error rates. However theactual amount of data is increasing to the point that it makes anundetectable bit error likely, just by throwing a huge amount of bits atstorage. Write ordering doesn't address this issue.

Virtualization is one of the uses where you have a huge number of bits.btrfs addresses this, but if you have (working) btrfs you don't needqed. Another problem is nfs; TCP and UDP checksums are incredibly weakand it is easy for a failure to bypass them. Ethernet CRCs are better,but they only work if the error is introduced after the CRC is taken andbefore it is verified.

Well, if we introduce a minimal format, we need to make sure it isn'ttoo minimal.
I'm still not sold on the idea. What we're doing now is pushing theqcow2 complexity to users. We don't have to worry about refcountsnow, but users have to worry whether they're the machine they'recopying the image to supports qed or not.
The performance problems with qcow2 are solvable. If we preallocateclusters, the performance characteristics become essentially the sameas qed.
By creating two code paths within qcow2.


You're creating two code paths for users.

It's not just the reference counts, it's the lack of guaranteedalignment, compression, and some of the other poor decisions in theformat.
If you have two code paths in qcow2, you have non-deterministicperformance because users that do reasonable things with their imageswill end up getting catastrophically bad performance.

We can address that in the tools. "By enabling compression, you mayreduce performance for multithreaded workloads. Abort/Retry/Ignore?"

A new format doesn't introduce much additional complexity. We provideimage conversion tool and we can almost certainly provide an in-placeconversion tool that makes the process very fast.

It requires users to make a decision. By the time qed is ready for massdeployment, 1-2 years will have passed. How many qcow2 images will bein the wild then? How much scheduled downtime will be needed? How muchuser confusion will be caused?

Virtualization is about compatibility. In-guest compatibility first,but keeping the external environment stable is also important. Wereally need to exhaust the possibilities with qcow2 before giving up on it.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, (continued)

Prev by Date: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by Date: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Previous by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread