[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [RFC] Disk integrity in QEMU

From: Anthony Liguori
Subject: [Qemu-devel] [RFC] Disk integrity in QEMU
Date: Thu, 09 Oct 2008 12:00:41 -0500
User-agent: Thunderbird (X11/20080925)


There's been a lot of discussion recently mostly in other places about disk integrity and performance in QEMU. I must admit, my own thinking has changed pretty recently in this space. I wanted to try and focus the conversation on qemu-devel so that we could get everyone involved and come up with a plan for the future.

Right now, QEMU can open a file in two ways. It can open it without any special caching flags (the default) or it can open it O_DIRECT. O_DIRECT implies that the IO does not go through the host page cache. This is controlled with cache=on and cache=off respectively.

When cache=on, read requests may not actually go to the disk. If a previous read request (by some application on the system) has read the same data, then it becomes a simple memcpy(). Also, the host IO scheduler may do read ahead which means that the data may be available from that. In general, the host knows the most about the underlying disk system and the total IO load on the system so it is far better suited to optimize these sort of things than the guest.

Write requests end up being simple memcpy()s too as the data is just copied into the page cache and the page is scheduled to be eventually written to disk. Since we don't know when the data is actually written to disk, we tell the guest the data is written before it actually is.

If you assume that the host is stable, then there isn't an integrity issue. This assumes that you have backup power and that the host OS has no bugs. It's not a totally unreasonable assumption but for a large number of users, it's not a good assumption.

A side effect of cache=off is that data integrity only depends on the integrity of your storage system (which isn't always safe, btw) which is probably closer to what most users expect. There many other side effects though.

An alternative to cache=off that addresses the data integrity problem directly is to open all disk images with O_DSYNC. This will still use the host page cache (and therefore get all the benefits of it) but will only signal write completion when the data is actually written to disk. The effect of this is to make the integrity of the VM equal the integrity of the storage system (no longer relying on the host). By still going through the page cache, you still get the benefits of the host's IO scheduler and read-ahead. The only place affected by performance is writes (reads are equivalent). If you run a write benchmark in a guest today, you'll see a number that is higher than native. The implication here is that data integrity is not being maintained if you don't trust the host. O_DSYNC takes care of this.

Read performance should be unaffected by using O_DSYNC. O_DIRECT will significantly reduce read performance. I think we should use O_DSYNC by default and I have sent out a patch that contains that. We will follow up with benchmarks to demonstrate this.

There are certain benefits to using O_DIRECT. One argument for using O_DIRECT is that you have to allocate memory in the host page cache to perform IO. If you are not sharing data between guests, and the guest has a relatively large amount of memory compared to the host, and you have a simple disk in the host, going through the host page cache wastes some memory that could be used to cache other IO operations on the system. I don't really think this is the typical case so I don't think this is an argument for having it on by default. However, it can be enabled if you know this is going to be the case.

The biggest benefit to using O_DIRECT, is that you can potentially avoid ever bringing data into the CPUs cache. Once data is cached, copying it is relatively cheap. If you're never going to touch the data (think, disk DMA => nic DMA via sendfile()), then avoiding the CPU cache can be a big win. Again, I don't think this is the common case but the option is there in case it's suitable.

An important point is that today, we always copy data internally in QEMU which means practically speaking, you'll never see this benefit.

So to summarize, I think we should enable O_DSYNC by default to ensure that guest data integrity is not dependent on the host OS, and that practically speaking, cache=off is only useful for very specialized circumstances. Part of the patch I'll follow up with includes changes to the man page to document all of this for users.



Anthony Liguori

reply via email to

[Prev in Thread] Current Thread [Next in Thread]