[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] KVM "fake DAX" device flushing
[Qemu-devel] KVM "fake DAX" device flushing
Wed, 10 May 2017 21:26:00 +0530
We are sharing initial project proposal for
'KVM "fake DAX" device flushing' project for feedback.
Got the idea during discussion with 'Rik van Riel'.
Also, request answers to 'Questions' section.
Project idea is to use fake persistent memory with direct
access(DAX) in virtual machines. Overall goal of project
is to increase the number of virtual machines that can be
run on a physical machine, in order to increase the density
of customer virtual machines.
The idea is to avoid the guest page cache, and minimize the
memory footprint of virtual machines. By presenting a disk
image as a nvdimm direct access (DAX) memory region in a
virtual machine, the guest OS can avoid using page cache
memory for most file accesses.
Problem Statement :
* Guest uses page cache in memory to process fast requests
for disk read/write. This results in big memory footprint
of guests without host knowing much details of the guest
* If guests use direct access(DAX) with fake persistent
storage, the host manages the page cache for guests,
allowing the host to easily reclaim/evict less frequently
used page cache pages without requiring guest cooperation,
like ballooning would.
* Host manages guest cache as ‘mmaped’ disk image area in
qemu address space. This region is passed to guest as fake
persistent memory range. We need a new flushing interface
to flush this cache to secondary storage to persist guest
* New asynchronous flushing interface will allow guests to
cause the host flush the dirty data to backup storage file.
Systems with pmem storage make use of CLFLUSH instruction
to flush single cache line to persistent storage and it
takes care of flushing. With fake persistent storage in
guest we cannot depend on CLFLUSH instruction to flush entire
dirty cache to backing storage. Even If we trap and emulate
CLFLUSH instruction guest vCPU has to wait till we flush all
the dirty memory. Instead of this we need to implement a new
asynchronous guest flushing interface, which allows the guest
to specify a larger range to be flushed at once, and allows
the vCPU to run something else while the data is being synced
* New flushing interface will consists of a para virt driver to
new fake nvdimm like device which will process guest flushing
requests like fsync/msync etc instead of pmem library calls
like clflush. The corresponding device at host side will be
responsible for flushing requests for guest dirty pages.
Guest can put current task in sleep and vCPU can run any other
task while host side flushing of guests pages is in progress.
Host controlled fake nvdimm DAX to avoid guest page cache :
* Bypass guest page cache by using a fake persistent storage
like nvdimm & DAX. Guest Read/Write is directly done on
fake persistent storage without involving guest kernel for
* Fake nvdimm device passed to guest is backed by a regular
file in host stored in secondary storage.
* Qemu has implementation of fake NVDIMM/DAX device. Use this
capability of passing regular host file(disk) as nvdimm device
* Nvdimm with DAX works for ext4/xfs filesystem. Supported
filesystem should be DAX compatible.
* As we are using guest disk as fake DAX/NVDIMM device, we
need a mechanism for persistence of data backed on regular
host storage file.
* For live migration use case, if host side backing file is
shared storage, we need to flush the page cache for the disk
image at the destination (new fadvise interface, FADV_INVALIDATE_CACHE?)
before starting execution of the guest on the destination host.
* In order to not have page cache inside the guest, qemu would:
1) mmap the guest's disk image and present that disk image to
the guest as a persistent memory range.
2) Present information to the guest telling it that the persistent
memory range is not physical persistent memory.
3) Present an additional paravirt device alongside the persistent
memory range, that can be used to sync (ranges of) data to disk.
* Guest would use the disk image mostly like a persistent memory
device, with two exceptions:
1) It would not tell userspace that the files on that device are
persistent memory. This is done so userspace knows to call
fsync/msync, instead of the pmem clflush library call.
2) When userspace calls fsync/msync on files on the fake persistent
memory device, issue a request through the paravirt device that
causes the host to flush the device back end.
* Guest uses fake persistent storage data updates can be still in
qemu memory. We need a way to flush cached data in host to backed
* Once the guest receives a completion event from the host, it will
allow userspace programs that were waiting on the fsync/msync to
* Host is responsible for paging in pages in host backing area for
guest persistent memory as they are accessed by the guest, and
for evicting pages as host memory fills up.
* What should the flushing interface between guest and host look
* Any suggestions to hook the IO caching code with KVM/Qemu or
thoughts on how we should do it?
* Thinking of implementing a guest para virt driver which will send
guest requests to Qemu to flush data to disk. Not sure at this
point how to tell userspace to work on this device as any regular
device without considering it as persistent device. Any suggestions
* Not thought yet about ballooning impact. But feel this solution
could be better than ballooning in long term? As we will be
managing all guests cache from host side.
* Not sure this solution works for ARM and other architectures and
Re: [Qemu-devel] KVM "fake DAX" device flushing, Dan Williams, 2017/05/11