Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format

From:	Avi Kivity
Subject:	Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Date:	Fri, 10 Sep 2010 16:47:00 +0300
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.8) Gecko/20100806 Fedora/3.1.2-1.fc13 Thunderbird/3.1.2

 On 09/10/2010 04:14 PM, Anthony Liguori wrote:

On 09/10/2010 06:14 AM, Avi Kivity wrote:
The point of an image format is not to recreate btrfs in software.It's to provide a mechanism to allow users to move images aroundreasonable but once an image is present on a reasonable filesystem,we should more or less get the heck out of the way.
You can achieve exactly the same thing with qcow2. Yes, it's morework, but it's also less disruptive to users.
This is turning dangerously close into a vbus vs. virtio discussion :-)
Let me review the motivation for QED and why we've decided incrementalimprovements to qcow2 were not viable.
1) qcow2 has awful performance characteristics


The current qcow2 implementation, yes.  The qcow2 format, no.

2) qcow2 has historically had data integrity issues. It's unclearanyone is willing to say that they're 100% confident that there arestill data integrity issues in the format.

Fast forward a few years, no one will be 100% confident there are nodata integrity issues in qed.

3) The users I care most about are absolutely uncompromising aboutdata integrity. There is no room for uncertainty or trade offs whenyou're building an enterprise product.


100% in agreement here.

4) We have looked at trying to fix qcow2. It appears to be amonumental amount of work that starts with a rewrite where it'sunclear if we can even keep supporting all of the special features.IOW, there is likely to be a need for users to experience some type ofimage conversion or optimization process.


I don't see why.

5) A correct version of qcow2 has terrible performance.


Not inherently.

You need to do a bunch of fancy tricks to recover that performance.Every fancy trick needs to be carefully evaluated with respect tocorrectness. There's a large surface area for potential data corruptors.

s/large/larger/. The only real difference is the refcount table, whichI agree sucks, but happens to be nice for TRIM support.

We're still collecting performance data, but here's an example of whatwe're talking about.
FFSB Random Writes MB/s (Block Size=8KB)

                        Native        Raw         QCow2     QED
1 Thread           30.2           24.4         22.7           23.4
8 Threads        145.1         119.9        10.6          112.9
16 Threads      177.1         139.0        10.1          120.9
The performance difference is an order of magnitude. qcow2 bouncesall requests, needs to issue synchronous metadata updates, and onlysupports a single outstanding request at a time.

Those are properties of the implementation, not the format. The formatmakes it harder to get it right but doesn't give us a free pass not todo it.

With good performance and high confidence in integrity, it's a nobrainer as far as I'm concerned. We have a format that it easy torationalize as correct, performs damn close to raw. On the otherhand, we have a format that no one is confident that is correct thatis even harder to rationalize as correct, and is an order of magnitudeoff raw in performance.
It's really a no brainer.

Sure, because you don't care about users. All of the complexity ofchanging image formats (and deciding whether to do that or not) ishidden away.

The impact to users is minimal. Upgrading images to a new format isnot a big deal. This isn't guest visible and we're not talking aboutdeleting qcow2 and removing support for it.

It's a big deal to them. Users are not experts in qemu image formats.They will have to learn how to do it, whether they can do it (need toupgrade all your qemus before you can do it, need to make sure you'renot using qcow2 features, need to be sure you're not planning to useqcow2 features).


Sure, we'll support qcow2, but will we give it the same attention?

Today, users have to choose between performance and reliability orfeatures. QED offers an opportunity to be able to tell users tojust always use QED as an image format and forget aboutraw/qcow2/everything else.
raw will always be needed for direct volume access and sharedstorage. qcow2 will always be needed for old images.
My point is that for the future, the majority of people no longer haveto think about "do I need performance more than I need sparse images?".


That can be satisfied with qcow2 + preallocation.

If they have some special use case, fine, but for most people wesimplify their choices.
You can say, let's just make qcow2 better, but we've been tryingthat for years and we have an existence proof that we can do it in astraight forward fashion with QED.
When you don't use the extra qcow2 features, it has the sameperformance characteristics as qed.
If you're willing to leak blocks on a scale that is still unknown.


Who cares, those aren't real storage blocks.

It's not at all clear that making qcow2 have the same characteristicsas qed is an easy problem. qed is specifically designed to avoidsynchronous metadata updates. qcow2 cannot achieve that.

qcow2 and qed are equivalent if you disregard the refcount table (whichwe address by preallocation). Exactly the same technique you use forsync-free metadata updates in qed can be used for qcow2.

You can *potentially* batch metadata updates by preallocatingclusters, but what's the right amount to preallocate


You look at your write rate and adjust it dynamically so you never wait.

and is it really okay to leak blocks at that scale?

Again, those aren't real blocks. And we're talking power loss anyway.It's certainly better than requiring fsck for correctness.

It's a weak story either way. There's a burden of proof stillrequired to establish that this would, indeed, address the performanceconcerns.

I don't see why you doubt it so much. Amortization is an well knowntechnique for reducing the cost of expensive operations.

You need to batch allocation and freeing, but that's fairlystraightforward.
Yes, qcow2 has a long and tortured history and qed is perfect.Starting from scratch is always easier and more fun. Except for theusers.
The fact that you're basing your argument on "think of the users" isstrange because you're advocating not doing something that is going tobe hugely beneficial for our users.

You misunderstand me. I'm not advocating dropping qed and stoppingqcow2 development. I'm advocating dropping qed and working on qcow2 toprovide the benefits that qed brings.

You're really arguing that we should continue only offering a formatwith weak data integrity and even weaker performance.


Those are not properties of the format, only of the implementation.

A new format doesn't introduce much additional complexity. Weprovide image conversion tool and we can almost certainly provide anin-place conversion tool that makes the process very fast.
It introduces a lot of complexity for the users who aren't qedexperts. They need to make a decision. What's the impact of thechange? Are the features that we lose important to us? Do we knowwhat they are? Is there any risk? Can we make the change online ordo we have to schedule downtime? Do all our hosts support qed?
It's very simple. Use qed, convert all existing images. Imageconversion is a part of virtualization. We have tools to do it. Ifthey want to stick with qcow2 and are happy with it, fine, no one isadvocating removing it.

This simple formula doesn't work if some of your hosts don't support qedyet. And it's still complicated for users because they have tounderstand all of that. "trust me, use qed" is not going to work.

Image conversion is a part of virtualization, yes. A sucky part, weshould try to avoid it.

We can solve all possible problems and have images that users can moveback to arbitrarily old versions of qemu with all of the sameadvantages of the newer versions. It's not realistic.


True, but we can do better that replace the image format.

Improving qcow2 will be very complicated for Kevin who already looksolder beyond his years [1] but very simple for users.
I think we're all better off if we move past sunk costs and focus onsolving other problems. I'd rather we all focus on improvingperformance and correctness even further than trying to make qcow2 beas good as what every other hypervisor had 5 years ago.
qcow2 has been a failure. Let's live up to it and move on. Makingstatements at each release that qcow2 has issues but we'll fix it soonjust makes us look like we don't know what we're doing.


Switching file formats is a similar statement.

User confusion is reduced if we can make strong, clear statements:all users should use QED even if they care about performance.Today, there's mass confusion because of the poor state of qcow2.
If we improve qcow2 and make the same strong, clear statement we'llhave the same results.
To be honest, the brand is tarnished. Once something gains areputation for having poor integrity, it's very hard to overcome that.
Even if you have Kevin spend the next 6 months rewriting qcow2 fromscratch, I'm going to have a hard time convincing customers trust it.
All someone has to do is look at change logs to see that it has a badhistory. That's more than enough to make people very nervous.

People will be nervous of something completely new (though I agree thesimplicity is a very strong point of qed).

IMHO, we're long past exhausting the possibilities with qcow2. Westill haven't decided what we're going to do for 0.13.0.
Sorry, I disagree 100%. How can you say that, when no one has yettried, for example, batching allocations and frees? Or properlythreaded it?
We've spent years trying to address problems in qcow2. And Stefanspecifically has spent a good amount of time trying to fix qcow2. Iknow you've spent time trying to thread it too. I don't think youreally grasp how difficult of a problem it is to fix qcow2. It's notjust that the code is bad, the format makes something that should besimple more complicated than it needs to be.

IMO, the real problem is the state machine implementation. Threading itwould make it much simpler. I wish I had the time to go back to do that.

What is specifically so bad about qcow2? The refcount table? Ithappens to be necessary for TRIM. Copy-on-write? It's needed forexternal snapshots.

qcow2 is not a properly designed image format. It was a weekendhacking session from Fabrice that he dropped in the code base andnever really finished doing what he originally intended. Theimprovements that have been made to it are almost at the heroiclevel but we're only hurting our users by not moving on to somethingbetter.
I don't like qcow2 either. But from a performance perspective, itcan be made equivalent to qed with some effort. It is worthwhile toexpend that effort rather than push the burden to users.
The choices we have 1) provide our users a format that has highperformance and good data integrity 2) continue to only offer a formatthat has poor performance and bad data integrity and promise thatwe'll eventually fix it.
We've been doing (2) for too long now. We need to offer a solution tousers today. It's not fair to our users to not offer them a goodsolution just because we don't want to admit to previous mistakes.
If someone can fix qcow2 and make it competitive, by all means, pleasedo.

We can have them side by side and choose later based on performance.Though I fear if qed is merged qcow2 will see no further work.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format, (continued)

Prev by Date: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by Date: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Previous by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Next by thread: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
Index(es):
- Date
- Thread