qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend


From: Yi Liu
Subject: Re: [RFC 15/18] vfio/iommufd: Implement iommufd backend
Date: Tue, 26 Apr 2022 22:08:30 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0 Thunderbird/91.7.0


On 2022/4/26 21:41, Jason Gunthorpe wrote:
On Tue, Apr 26, 2022 at 10:41:01AM +0000, Tian, Kevin wrote:

That's one case of incompatibility, but the IOMMU attach group callback
can fail in a variety of ways.  One that we've seen that is not
uncommon is that we might have an mdev container with various  mappings
to other devices.  None of those mappings are validated until the mdev
driver tries to pin something, where it's generally unlikely that
they'd pin those particular mappings.  Then QEMU hot-adds a regular
IOMMU backed device, we allocate a domain for the device and replay the
mappings from the container, but now they get validated and potentially
fail.  The kernel returns a failure for the SET_IOMMU ioctl, QEMU
creates a new container and fills it from the same AddressSpace, where
now QEMU can determine which mappings can be safely skipped.

I think it is strange that the allowed DMA a guest can do depends on
the order how devices are plugged into the guest, and varys from
device to device?

IMHO it would be nicer if qemu would be able to read the new reserved
regions and unmap the conflicts before hot plugging the new device. We
don't have a kernel API to do this, maybe we should have one?

For userspace drivers, it is fine to do it. For QEMU, it's not quite easy since the IOVA is GPA which is determined per the e820 table.

A:
QEMU sets up a MemoryListener for the device AddressSpace and attempts
to map anything that triggers that listener, which includes not only VM
RAM which is our primary mapping goal, but also miscellaneous devices,
unaligned regions, and other device regions, ex. BARs.  Some of these
we filter out in QEMU with broad generalizations that unaligned ranges
aren't anything we can deal with, but other device regions covers
anything that's mmap'd in QEMU, ie. it has an associated KVM memory
slot.  IIRC, in the case I'm thinking of, the mapping that triggered
the replay failure was the BAR for an mdev device.  No attempt was made
to use gup or PFNMAP to resolve the mapping when only the mdev device
was present and the mdev host driver didn't attempt to pin pages within
its own BAR, but neither of these methods worked for the replay (I
don't recall further specifics).

This feels sort of like a bug in iommufd, or perhaps qemu..

With iommufd only normal GUP'able memory should be passed to
map. Special memory will have to go through some other API. This is
different from vfio containers.

We could possibly check the VMAs in iommufd during map to enforce
normal memory.. However I'm also a bit surprised that qemu can't ID
the underlying memory source and avoid this?

eg currently I see the log messages that it is passing P2P BAR memory
into iommufd map, this should be prevented inside qemu because it is
not reliable right now if iommufd will correctly reject it.

yeah. qemu can filter the P2P BAR mapping and just stop it in qemu. We
haven't added it as it is something you will add in future. so didn't
add it in this RFC. :-) Please let me know if it feels better to filter
it from today.

IMHO multi-container should be avoided because it does force creating
multiple iommu_domains which does have a memory/performance cost.

yes. for multi-hw_pgtable, there is no choice as it is mostly due to
compatibility. But for multi-container, seems to be solvable if kernel
and qemu has some extra support like you mentioned. But I'd like to
echo below. It seems there may be other possible reasons to fail in
the attach.

>> That's one case of incompatibility, but the IOMMU attach group callback
>> can fail in a variety of ways."

Though, it is not so important that it is urgent (and copy makes it
work better anyhow), qemu can stay as it is.

yes. as a start, keep it would be simpler.

Jason

--
Regards,
Yi Liu



reply via email to

[Prev in Thread] Current Thread [Next in Thread]