[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [RFC] Disk integrity in QEMU
[Qemu-devel] [RFC] Disk integrity in QEMU
Thu, 09 Oct 2008 12:00:41 -0500
Thunderbird 220.127.116.11 (X11/20080925)
There's been a lot of discussion recently mostly in other places about
disk integrity and performance in QEMU. I must admit, my own thinking
has changed pretty recently in this space. I wanted to try and focus
the conversation on qemu-devel so that we could get everyone involved
and come up with a plan for the future.
Right now, QEMU can open a file in two ways. It can open it without any
special caching flags (the default) or it can open it O_DIRECT.
O_DIRECT implies that the IO does not go through the host page cache.
This is controlled with cache=on and cache=off respectively.
When cache=on, read requests may not actually go to the disk. If a
previous read request (by some application on the system) has read the
same data, then it becomes a simple memcpy(). Also, the host IO
scheduler may do read ahead which means that the data may be available
from that. In general, the host knows the most about the underlying
disk system and the total IO load on the system so it is far better
suited to optimize these sort of things than the guest.
Write requests end up being simple memcpy()s too as the data is just
copied into the page cache and the page is scheduled to be eventually
written to disk. Since we don't know when the data is actually written
to disk, we tell the guest the data is written before it actually is.
If you assume that the host is stable, then there isn't an integrity
issue. This assumes that you have backup power and that the host OS has
no bugs. It's not a totally unreasonable assumption but for a large
number of users, it's not a good assumption.
A side effect of cache=off is that data integrity only depends on the
integrity of your storage system (which isn't always safe, btw) which is
probably closer to what most users expect. There many other side
An alternative to cache=off that addresses the data integrity problem
directly is to open all disk images with O_DSYNC. This will still use
the host page cache (and therefore get all the benefits of it) but will
only signal write completion when the data is actually written to disk.
The effect of this is to make the integrity of the VM equal the
integrity of the storage system (no longer relying on the host). By
still going through the page cache, you still get the benefits of the
host's IO scheduler and read-ahead. The only place affected by
performance is writes (reads are equivalent). If you run a write
benchmark in a guest today, you'll see a number that is higher than
native. The implication here is that data integrity is not being
maintained if you don't trust the host. O_DSYNC takes care of this.
Read performance should be unaffected by using O_DSYNC. O_DIRECT will
significantly reduce read performance. I think we should use O_DSYNC by
default and I have sent out a patch that contains that. We will follow
up with benchmarks to demonstrate this.
There are certain benefits to using O_DIRECT. One argument for using
O_DIRECT is that you have to allocate memory in the host page cache to
perform IO. If you are not sharing data between guests, and the guest
has a relatively large amount of memory compared to the host, and you
have a simple disk in the host, going through the host page cache wastes
some memory that could be used to cache other IO operations on the
system. I don't really think this is the typical case so I don't think
this is an argument for having it on by default. However, it can be
enabled if you know this is going to be the case.
The biggest benefit to using O_DIRECT, is that you can potentially avoid
ever bringing data into the CPUs cache. Once data is cached, copying it
is relatively cheap. If you're never going to touch the data (think,
disk DMA => nic DMA via sendfile()), then avoiding the CPU cache can be
a big win. Again, I don't think this is the common case but the option
is there in case it's suitable.
An important point is that today, we always copy data internally in QEMU
which means practically speaking, you'll never see this benefit.
So to summarize, I think we should enable O_DSYNC by default to ensure
that guest data integrity is not dependent on the host OS, and that
practically speaking, cache=off is only useful for very specialized
circumstances. Part of the patch I'll follow up with includes changes
to the man page to document all of this for users.