qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: DMA region abruptly removed from PCI device


From: Felipe Franciosi
Subject: Re: DMA region abruptly removed from PCI device
Date: Tue, 7 Jul 2020 10:38:01 +0000


> On Jul 6, 2020, at 3:20 PM, Alex Williamson <alex.williamson@redhat.com> 
> wrote:
> 
> On Mon, 6 Jul 2020 10:55:00 +0000
> Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> 
>> We have an issue when using the VFIO-over-socket libmuser PoC
>> (https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead 
>> of
>> the VFIO kernel module: we notice that DMA regions used by the emulated 
>> device
>> can be abruptly removed while the device is still using them.
>> 
>> The PCI device we've implemented is an NVMe controller using SPDK, so it 
>> polls
>> the submission queues for new requests. We use the latest SeaBIOS where it 
>> tries
>> to boot from the NVMe controller. Several DMA regions are registered
>> (VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created.
>> From this point SPDK polls both queues. Then, the DMA region where the
>> submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at 
>> the
>> same IOVA but at a different offset. SPDK crashes soon after as it accesses
>> invalid memory. There is no other event (e.g. some PCI config space or NVMe
>> register write) happening between unmapping and mapping the DMA region. My 
>> guess
>> is that this behavior is legitimate and that this is solved in the VFIO 
>> kernel
>> module by releasing the DMA region only after all references to it have been
>> released, which is handled by vfio_pin/unpin_pages, correct? If this is the 
>> case
>> then I suppose we need to implement the same logic in libmuser, but I just 
>> want
>> to make sure I'm not missing anything as this is a substantial change.
> 
> The vfio_{pin,unpin}_pages() interface only comes into play for mdev
> devices and even then it's an announcement that a given mapping is
> going away and the vendor driver is required to release references.
> For normal PCI device assignment, vfio-pci is (aside from a few quirks)
> device agnostic and the IOMMU container mappings are independent of the
> device.  We do not have any device specific knowledge to know if DMA
> pages still have device references.  The user's unmap request is
> absolute, it cannot fail (aside from invalid usage) and upon return
> there must be no residual mappings or references of the pages.
> 
> If you say there's no config space write, ex. clearing bus master from
> the command register, then something like turning on a vIOMMU might
> cause a change in the entire address space accessible by the device.
> This would cause the identity map of IOVA to GPA to be replaced by a
> new one, perhaps another identity map if iommu=pt or a more restricted
> mapping if the vIOMMU is used for isolation.
> 
> It sounds like you have an incomplete device model, physical devices
> have their address space adjusted by an IOMMU independent of, but
> hopefully in collaboration with a device driver.  If a physical device
> manages to bridge this transition, do what it does.  Thanks,

Hi,

That's what we are trying to work out. IIUC, the problem we are having
is that a mapping removal was requested but the device was still
operational. We can surely figure out how to handle that gracefully,
but I'm trying to get my head around how real hardware handles that.
Maybe you can add some colour here. :)

What happens when a device tries to write to a physical address that
has no memory behind it? Is it an MCE of sorts?

I haven't really ever looked at memory hot unplug in detail, but
after reading some QEMU code this is my understanding:

1) QEMU makes an ACPI request to the guest OS for mem unplug
2) Guest OS acks that memory can be pulled out
3) QEMU pulls the memory from the guest

Before step 3, I'm guessing that QEMU tells all device backends that
this memory is going away. I suppose that in normal operation, the
Guest OS will have already stopped using the memory (ie. before step
2), so there shouldn't be much to it. But I also suppose a malicious
guest could go "ah, you want to remove this dimm? sure, let me just
ask all these devices to start using it first... ok, there you go."

Is this understanding correct?

I don't think that's the case we're running into, though, but I think
we need to consider it at this time. What's probably happening here is
that the guest went from SeaBIOS to the kernel, a PCI reset happened
and we didn't plumb that message through correctly. While we are at
it, we should review the memory hot unplug business.

Thanks,
Felipe

> 
> Alex
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]