[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v5 03/18] pci: isolated address space for PCI bus
From: |
Jag Raman |
Subject: |
Re: [PATCH v5 03/18] pci: isolated address space for PCI bus |
Date: |
Thu, 10 Feb 2022 22:23:01 +0000 |
> On Feb 10, 2022, at 3:02 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Feb 10, 2022 at 12:08:27AM +0000, Jag Raman wrote:
>>
>>
>>> On Feb 2, 2022, at 12:34 AM, Alex Williamson <alex.williamson@redhat.com>
>>> wrote:
>>>
>>> On Wed, 2 Feb 2022 01:13:22 +0000
>>> Jag Raman <jag.raman@oracle.com> wrote:
>>>
>>>>> On Feb 1, 2022, at 5:47 PM, Alex Williamson <alex.williamson@redhat.com>
>>>>> wrote:
>>>>>
>>>>> On Tue, 1 Feb 2022 21:24:08 +0000
>>>>> Jag Raman <jag.raman@oracle.com> wrote:
>>>>>
>>>>>>> On Feb 1, 2022, at 10:24 AM, Alex Williamson
>>>>>>> <alex.williamson@redhat.com> wrote:
>>>>>>>
>>>>>>> On Tue, 1 Feb 2022 09:30:35 +0000
>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>
>>>>>>>> On Mon, Jan 31, 2022 at 09:16:23AM -0700, Alex Williamson wrote:
>>>>>>>>> On Fri, 28 Jan 2022 09:18:08 +0000
>>>>>>>>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Thu, Jan 27, 2022 at 02:22:53PM -0700, Alex Williamson wrote:
>>>>>>>>>>
>>>>>>>>>>> If the goal here is to restrict DMA between devices, ie.
>>>>>>>>>>> peer-to-peer
>>>>>>>>>>> (p2p), why are we trying to re-invent what an IOMMU already does?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The issue Dave raised is that vfio-user servers run in separate
>>>>>>>>>> processses from QEMU with shared memory access to RAM but no direct
>>>>>>>>>> access to non-RAM MemoryRegions. The virtiofs DAX Window BAR is one
>>>>>>>>>> example of a non-RAM MemoryRegion that can be the source/target of
>>>>>>>>>> DMA
>>>>>>>>>> requests.
>>>>>>>>>>
>>>>>>>>>> I don't think IOMMUs solve this problem but luckily the vfio-user
>>>>>>>>>> protocol already has messages that vfio-user servers can use as a
>>>>>>>>>> fallback when DMA cannot be completed through the shared memory RAM
>>>>>>>>>> accesses.
>>>>>>>>>>
>>>>>>>>>>> In
>>>>>>>>>>> fact, it seems like an IOMMU does this better in providing an IOVA
>>>>>>>>>>> address space per BDF. Is the dynamic mapping overhead too much?
>>>>>>>>>>> What
>>>>>>>>>>> physical hardware properties or specifications could we leverage to
>>>>>>>>>>> restrict p2p mappings to a device? Should it be governed by machine
>>>>>>>>>>> type to provide consistency between devices? Should each "isolated"
>>>>>>>>>>> bus be in a separate root complex? Thanks,
>>>>>>>>>>
>>>>>>>>>> There is a separate issue in this patch series regarding isolating
>>>>>>>>>> the
>>>>>>>>>> address space where BAR accesses are made (i.e. the global
>>>>>>>>>> address_space_memory/io). When one process hosts multiple vfio-user
>>>>>>>>>> server instances (e.g. a software-defined network switch with
>>>>>>>>>> multiple
>>>>>>>>>> ethernet devices) then each instance needs isolated memory and io
>>>>>>>>>> address
>>>>>>>>>> spaces so that vfio-user clients don't cause collisions when they map
>>>>>>>>>> BARs to the same address.
>>>>>>>>>>
>>>>>>>>>> I think the the separate root complex idea is a good solution. This
>>>>>>>>>> patch series takes a different approach by adding the concept of
>>>>>>>>>> isolated address spaces into hw/pci/.
>>>>>>>>>
>>>>>>>>> This all still seems pretty sketchy, BARs cannot overlap within the
>>>>>>>>> same vCPU address space, perhaps with the exception of when they're
>>>>>>>>> being sized, but DMA should be disabled during sizing.
>>>>>>>>>
>>>>>>>>> Devices within the same VM context with identical BARs would need to
>>>>>>>>> operate in different address spaces. For example a translation offset
>>>>>>>>> in the vCPU address space would allow unique addressing to the
>>>>>>>>> devices,
>>>>>>>>> perhaps using the translation offset bits to address a root complex
>>>>>>>>> and
>>>>>>>>> masking those bits for downstream transactions.
>>>>>>>>>
>>>>>>>>> In general, the device simply operates in an address space, ie. an
>>>>>>>>> IOVA. When a mapping is made within that address space, we perform a
>>>>>>>>> translation as necessary to generate a guest physical address. The
>>>>>>>>> IOVA itself is only meaningful within the context of the address
>>>>>>>>> space,
>>>>>>>>> there is no requirement or expectation for it to be globally unique.
>>>>>>>>>
>>>>>>>>> If the vfio-user server is making some sort of requirement that IOVAs
>>>>>>>>> are unique across all devices, that seems very, very wrong. Thanks,
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, BARs and IOVAs don't need to be unique across all devices.
>>>>>>>>
>>>>>>>> The issue is that there can be as many guest physical address spaces as
>>>>>>>> there are vfio-user clients connected, so per-client isolated address
>>>>>>>> spaces are required. This patch series has a solution to that problem
>>>>>>>> with the new pci_isol_as_mem/io() API.
>>>>>>>
>>>>>>> Sorry, this still doesn't follow for me. A server that hosts multiple
>>>>>>> devices across many VMs (I'm not sure if you're referring to the device
>>>>>>> or the VM as a client) needs to deal with different address spaces per
>>>>>>> device. The server needs to be able to uniquely identify every DMA,
>>>>>>> which must be part of the interface protocol. But I don't see how that
>>>>>>> imposes a requirement of an isolated address space. If we want the
>>>>>>> device isolated because we don't trust the server, that's where an IOMMU
>>>>>>> provides per device isolation. What is the restriction of the
>>>>>>> per-client isolated address space and why do we need it? The server
>>>>>>> needing to support multiple clients is not a sufficient answer to
>>>>>>> impose new PCI bus types with an implicit restriction on the VM.
>>>>>>
>>>>>> Hi Alex,
>>>>>>
>>>>>> I believe there are two separate problems with running PCI devices in
>>>>>> the vfio-user server. The first one is concerning memory isolation and
>>>>>> second one is vectoring of BAR accesses (as explained below).
>>>>>>
>>>>>> In our previous patches (v3), we used an IOMMU to isolate memory
>>>>>> spaces. But we still had trouble with the vectoring. So we implemented
>>>>>> separate address spaces for each PCIBus to tackle both problems
>>>>>> simultaneously, based on the feedback we got.
>>>>>>
>>>>>> The following gives an overview of issues concerning vectoring of
>>>>>> BAR accesses.
>>>>>>
>>>>>> The device’s BAR regions are mapped into the guest physical address
>>>>>> space. The guest writes the guest PA of each BAR into the device’s BAR
>>>>>> registers. To access the BAR regions of the device, QEMU uses
>>>>>> address_space_rw() which vectors the physical address access to the
>>>>>> device BAR region handlers.
>>>>>
>>>>> The guest physical address written to the BAR is irrelevant from the
>>>>> device perspective, this only serves to assign the BAR an offset within
>>>>> the address_space_mem, which is used by the vCPU (and possibly other
>>>>> devices depending on their address space). There is no reason for the
>>>>> device itself to care about this address.
>>>>
>>>> Thank you for the explanation, Alex!
>>>>
>>>> The confusion at my part is whether we are inside the device already when
>>>> the server receives a request to access BAR region of a device. Based on
>>>> your explanation, I get that your view is the BAR access request has
>>>> propagated into the device already, whereas I was under the impression
>>>> that the request is still on the CPU side of the PCI root complex.
>>>
>>> If you are getting an access through your MemoryRegionOps, all the
>>> translations have been made, you simply need to use the hwaddr as the
>>> offset into the MemoryRegion for the access. Perform the read/write to
>>> your device, no further translations required.
>>>
>>>> Your view makes sense to me - once the BAR access request reaches the
>>>> client (on the other side), we could consider that the request has reached
>>>> the device.
>>>>
>>>> On a separate note, if devices don’t care about the values in BAR
>>>> registers, why do the default PCI config handlers intercept and map
>>>> the BAR region into address_space_mem?
>>>> (pci_default_write_config() -> pci_update_mappings())
>>>
>>> This is the part that's actually placing the BAR MemoryRegion as a
>>> sub-region into the vCPU address space. I think if you track it,
>>> you'll see PCIDevice.io_regions[i].address_space is actually
>>> system_memory, which is used to initialize address_space_system.
>>>
>>> The machine assembles PCI devices onto buses as instructed by the
>>> command line or hot plug operations. It's the responsibility of the
>>> guest firmware and guest OS to probe those devices, size the BARs, and
>>> place the BARs into the memory hierarchy of the PCI bus, ie. system
>>> memory. The BARs are necessarily in the "guest physical memory" for
>>> vCPU access, but it's essentially only coincidental that PCI devices
>>> might be in an address space that provides a mapping to their own BAR.
>>> There's no reason to ever use it.
>>>
>>> In the vIOMMU case, we can't know that the device address space
>>> includes those BAR mappings or if they do, that they're identity mapped
>>> to the physical address. Devices really need to not infer anything
>>> about an address. Think about real hardware, a device is told by
>>> driver programming to perform a DMA operation. The device doesn't know
>>> the target of that operation, it's the guest driver's responsibility to
>>> make sure the IOVA within the device address space is valid and maps to
>>> the desired target. Thanks,
>>
>> Thanks for the explanation, Alex. Thanks to everyone else in the thread who
>> helped to clarify this problem.
>>
>> We have implemented the memory isolation based on the discussion in the
>> thread. We will send the patches out shortly.
>>
>> Devices such as “name" and “e1000” worked fine. But I’d like to note that
>> the LSI device (TYPE_LSI53C895A) had some problems - it doesn’t seem
>> to be IOMMU aware. In LSI’s case, the kernel driver is asking the device to
>> read instructions from the CPU VA (lsi_execute_script() -> read_dword()),
>> which is forbidden when IOMMU is enabled. Specifically, the driver is asking
>> the device to access other BAR regions by using the BAR address programmed
>> in the PCI config space. This happens even without vfio-user patches. For
>> example,
>> we could enable IOMMU using “-device intel-iommu” QEMU option and also
>> adding the following to the kernel command-line: “intel_iommu=on iommu=nopt”.
>> In this case, we could see an IOMMU fault.
>
> So, device accessing its own BAR is different. Basically, these
> transactions never go on the bus at all, never mind get to the IOMMU.
Hi Michael,
In LSI case, I did notice that it went to the IOMMU. The device is reading the
BAR
address as if it was a DMA address.
> I think it's just used as a handle to address internal device memory.
> This kind of trick is not universal, but not terribly unusual.
>
>
>> Unfortunately, we started off our project with the LSI device. So that lead
>> to all the
>> confusion about what is expected at the server end in-terms of
>> vectoring/address-translation. It gave an impression as if the request was
>> still on
>> the CPU side of the PCI root complex, but the actual problem was with the
>> device driver itself.
>>
>> I’m wondering how to deal with this problem. Would it be OK if we mapped the
>> device’s BAR into the IOVA, at the same CPU VA programmed in the BAR
>> registers?
>> This would help devices such as LSI to circumvent this problem. One problem
>> with this approach is that it has the potential to collide with another
>> legitimate
>> IOVA address. Kindly share your thought on this.
>>
>> Thank you!
>
> I am not 100% sure what do you plan to do but it sounds fine since even
> if it collides, with traditional PCI device must never initiate cycles
OK sounds good, I’ll create a mapping of the device BARs in the IOVA.
Thank you!
--
Jag
> within their own BAR range, and PCIe is software-compatible with PCI. So
> devices won't be able to access this IOVA even if it was programmed in
> the IOMMU.
>
> As was mentioned elsewhere on this thread, devices accessing each
> other's BAR is a different matter.
>
> I do not remember which rules apply to multiple functions of a
> multi-function device though. I think in a traditional PCI
> they will never go out on the bus, but with e.g. SRIOV they
> would probably do go out? Alex, any idea?
>
>
>> --
>> Jag
>>
>>>
>>> Alex
>>>
>>
>
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Stefan Hajnoczi, 2022/02/01
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Alex Williamson, 2022/02/01
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Jag Raman, 2022/02/01
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Alex Williamson, 2022/02/01
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Jag Raman, 2022/02/01
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Alex Williamson, 2022/02/02
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Stefan Hajnoczi, 2022/02/02
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Jag Raman, 2022/02/09
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Michael S. Tsirkin, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus,
Jag Raman <=
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Michael S. Tsirkin, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Jag Raman, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Alex Williamson, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Michael S. Tsirkin, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Alex Williamson, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Michael S. Tsirkin, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Jag Raman, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Jag Raman, 2022/02/10
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Peter Maydell, 2022/02/02
- Re: [PATCH v5 03/18] pci: isolated address space for PCI bus, Michael S. Tsirkin, 2022/02/02