[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-back
From: |
Dan Williams |
Subject: |
Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file |
Date: |
Wed, 31 Jan 2018 19:11:07 -0800 |
[ adding Michal and lsf-pci ]
On Wed, Jan 31, 2018 at 7:02 PM, Dan Williams <address@hidden> wrote:
> On Wed, Jan 31, 2018 at 6:29 PM, Haozhong Zhang
> <address@hidden> wrote:
>> + vfio maintainer Alex Williamson in case my understanding of vfio is
>> incorrect.
>>
>> On 01/31/18 16:32 -0800, Dan Williams wrote:
>>> On Wed, Jan 31, 2018 at 4:24 PM, Haozhong Zhang
>>> <address@hidden> wrote:
>>> > On 01/31/18 16:08 -0800, Dan Williams wrote:
>>> >> On Wed, Jan 31, 2018 at 4:02 PM, Haozhong Zhang
>>> >> <address@hidden> wrote:
>>> >> > On 01/31/18 14:25 -0800, Dan Williams wrote:
>>> >> >> On Tue, Jan 30, 2018 at 10:02 PM, Haozhong Zhang
>>> >> >> <address@hidden> wrote:
>>> >> >> > Linux 4.15 introduces a new mmap flag MAP_SYNC, which can be used to
>>> >> >> > guarantee the write persistence to mmap'ed files supporting DAX
>>> >> >> > (e.g.,
>>> >> >> > files on ext4/xfs file system mounted with '-o dax').
>>> >> >>
>>> >> >> Wait, MAP_SYNC does not guarantee persistence. It makes sure that the
>>> >> >> metadata is in sync after a fault. However, that does not make
>>> >> >> filesystem-DAX safe for use with QEMU, because we still need to
>>> >> >> coordinate DMA with fileystem operations. There is no way to do that
>>> >> >> coordination from within a guest. QEMU needs to use device-dax if the
>>> >> >> guest might ever perform DMA to a virtual-pmem range. See this patch
>>> >> >> set for more details on the DAX vs DMA problem [1]. I think we need to
>>> >> >> enforce this in the host kernel. I.e. do not allow file backed DAX
>>> >> >> pages to be mapped in EPT entries unless / until we have a solution to
>>> >> >> the DMA synchronization problem. Apologies for not noticing this
>>> >> >> earlier.
>>> >> >
>>> >> > QEMU does not truncate or punch holes of the file once it has been
>>> >> > mmap()'ed. Does the problem [1] still exist in such case?
>>> >>
>>> >> Something else on the system might. The only agent that could enforce
>>> >> protection is the kernel, and the kernel will likely just disallow
>>> >> passing addresses from filesystem-dax vmas through to a guest
>>> >> altogether. I think there's even a problem in the non-DAX case unless
>>> >> KVM is pinning pages while they are handed out to a guest. The problem
>>> >> is that we don't have a page cache page to pin in the DAX case.
>>> >>
>>> >
>>> > Does it mean any user-space code like
>>> > ptr = mmap(..., fd, ...); // fd refers to a file on DAX filesystem
>>> > // make DMA to ptr
>>> > is unsafe?
>>>
>>> Yes, it is currently unsafe because there is no coordination with the
>>> filesytem if it decides to make block layout changes. We can fix that
>>> in the non-virtualization case by having the filesystem wait for DMA
>>> completion callbacks (i.e. what for all pages to be idle), but as far
>>> as I can see we can't do the same coordination for DMA initiated by a
>>> guest device driver.
>>>
>>
>> I think that fix [1] also works for KVM/QEMU. The guest DMA are
>> performed on two types of devices:
>>
>> 1. For emulated devices, the guest DMA requests are trapped and
>> actually performed by QEMU on the host side. The host side fix [1]
>> can cover this case.
>>
>> 2. For passthrough devices, vfio pins all pages, including those
>> backed by dax mode files, used by the guest if any device is
>> passthroughed to it. If I read the commit message in [2] correctly,
>> operations that change the page-to-file offset association of pages
>> from dax mode files will be deferred until the reference count of
>> the affected pages becomes 1. That is, if any passthrough device
>> is used with a VM, the changes of page-to-file offset will not be
>> able to happen until the VM is shutdown, so the fix [1] still takes
>> effect here.
>
> This sounds like a longterm mapping under control of vfio and not the
> filesystem. See get_user_pages_longterm(), it is a problem if pages
> are pinned indefinitely especially DAX. It sounds like vfio is in the
> same boat as RDMA and cannot support long lived pins of DAX pages. As
> of 4.15 RDMA to filesystem-DAX pages has been disabled. The eventual
> fix will be to create a "memory-registration with lease" semantic
> available for RDMA so that the kernel can forcibly revoke page pinning
> to perform physical layout changes. In the near it seems
> vaddr_get_pfn() needs to be fixed to use get_user_pages_longterm() so
> that filesystem-dax mappings are explicitly disallowed.
>
>> Another question is how a user-space application (e.g., QEMU) knows
>> whether it's safe to mmap a file on the DAX file system?
>
> I think we fix vaddr_get_pfn() to start failing for DAX mappings
> unless/until we can add a "with lease" mechanism. Userspace will know
> when it is safe again when vfio stops failing.
Btw, there is an LSF/MM topic proposal on this subject [1].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-January/013935.html
- [Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter, (continued)
- [Qemu-devel] [PATCH v4 1/6] util/mmap-alloc: switch qemu_ram_mmap() to 'flags' parameter, Haozhong Zhang, 2018/01/31
- [Qemu-devel] [PATCH v4 2/6] exec: switch qemu_ram_alloc_from_{file, fd} to the 'flags' parameter, Haozhong Zhang, 2018/01/31
- [Qemu-devel] [PATCH v4 6/6] hostmem-file: add 'sync' option, Haozhong Zhang, 2018/01/31
- Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file, Dan Williams, 2018/01/31
- Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file, Haozhong Zhang, 2018/01/31
- Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file, Dan Williams, 2018/01/31
- Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file, Haozhong Zhang, 2018/01/31
- Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file, Dan Williams, 2018/01/31
- Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file, Haozhong Zhang, 2018/01/31
- Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file, Dan Williams, 2018/01/31
- Re: [Qemu-devel] [PATCH v4 0/6] nvdimm: support MAP_SYNC for memory-backend-file,
Dan Williams <=