Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI gro

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI gro

From:	Matthew Rosato
Subject:	Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups
Date:	Thu, 16 Dec 2021 10:16:10 -0500
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.0

On 12/16/21 3:15 AM, Pierre Morel wrote:

On 12/7/21 22:04, Matthew Rosato wrote:
Let's use the reserved pool of simulated PCI groups to allow intercept
devices to have separate groups from interpreted devices as some group
values may be different. If we run out of simulated PCI groups,subsequent
intercept devices just get the default group.
Furthermore, if we encounter any PCI groups from hostdevs that are marked
as simulated, let's just assign them to the default group to avoid
conflicts between host simulated groups and our own simulated groups.
I have a problem here.
We will have the same hardware viewed by 2 different VFIO implementation(interpretation vs interception) reporting different groups ID.

Yes -- To be clear, this patch proposes that the interpreted device willcontinue to report the passthrough group ID and the intercept devicewill use a simulated group ID.

The alternative is to have them reporting same group ID with differentvalues.

I don't think we can do this. For starters, we would have to throw outthe group tracking we do in QEMU; but for all we know the guest could bedoing similar tracking -- the implication of the group ID is thateveryone shares the same values so I don't think we can get away withreporting different values for 2 members of the same group.


I think the other alternative is rather to always do something like...

1) host reports its value via vfio capabilities as 'this is what aninterpreted device can use'2) QEMU must accept those values as-is OR reduce them to some subset ofwhat both interpretation and intercept can support, and report onlythose values for all devices in the group. (More on this further down)

I fear both are wrong.
On the other hand, should we have a difference in the QEMU command linebetween intercepted and interpreted devices for default values.

I'm not sure I follow what you suggest here. Even if we somehowprovided a command-line means for specifying some of these values, theywould still be presented to the guest via clp and if the guest has 2devices in the same group the clp results had better be the same.

If not why not give up the host values so that in an hypothetical futuremigration we are clean with the GID ?

Well, the interpreted device will use the passthrough group ID so in ahypothetical future migration scenario we should be good there.

And simulated devices will still use the default group, so we shouldalso be OK there.


This really changes the behavior for 2 other classes of device:

1) Intercept passthrough devices -- Yes, I agree that doing this is abit weird. But my thinking was that these devices should be theexception case rather than the norm moving forward, and it would clearlydilineate the different in Q PCI FNGRP values.

2) nested simulated devices -- These aren't using real GIDs anyway and Iwould expect them to also be using the default group already -- forcingthese to the default group was basically to make sure they didn'tconflict with the simulated groups being created for intercept devicesabove.

I am not sure of this, just want to open a little discussion on this.

FWIW, I'm not 100% on this either, so a better idea is welcome. Onething I don't like, for example, is that we only have 16 simulatedgroups to work with, and for example we might find it useful later tosplit simulated devices into different groups based on type.

For example what could go wrong to keep the host values returned by theCAP?

As-is, we risk advertising the wrong maxstbl and dtsm value for somedevices in the group, depending on which device is plugged first.Imagine you have 2 devices on group 5; one will be interpreted and theother intercepted.

If the interpreted device plugs first, we will use the passthroughmaxstbl and dtsm for all devices in the group; so the intercept devicegets these values too.

If the intercept device plugs first, we will use the QEMU value for DTSMand the smaller maxstbl requried for intercept passthrough. So theinterpreted device gets these values too.

Worth noting, we could have more of these differences later -- But if wewant to avoid splitting the group, then we I think we have to circleback to my 'alternative idea' above and provide equivalent support ortoleration for intercept devices so that we can report a single groupvalue that both types can support.

So insofar as dealing with the differences today... maxstbl is prettyeasy, we can just tolerate supporting the larger maxstbl in QEMU byadding logic to break up the I/O in pcistb_service_call. We might haveto provide 2 different maxstbl values over vfio capabilities however(what interpretation can support vs what kernel api supports forintercept as this could change between host kernel versions)

DTSM is a little trickier. We are actually OK today because bothintercept and interpreted devices will report the same value anyway, butthat could change in the future. Maybe here QEMU must report


dtsm = (QEMU_SUPPORT_MASK & HOST_SUPPORT_MASK);

So basically: ensure that only what both QEMU intercept and passthroughsupports is advertised via the clp. If we want to support a new typelater, then we must either support it in both kvm and QEMU to enable itfor the guest (or disallow intercept devices on that group, or providesome means of forcing an intercept device to the default group, etc)

If we do the above, then I think we can drop the idea of using simulatedgroups for intercpet passthrough devices. What do you think?


Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
  hw/s390x/s390-pci-bus.c         | 19 ++++++++++++++--
  hw/s390x/s390-pci-vfio.c        | 40 ++++++++++++++++++++++++++++++---
  include/hw/s390x/s390-pci-bus.h |  6 ++++-
  3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index ab442f17fb..8b0f3ef120 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c

@@ -747,13 +747,14 @@ static void s390_pci_iommu_free(S390pciState *s,PCIBus *bus, int32_t devfn)

      object_unref(OBJECT(iommu));
  }
-S390PCIGroup *s390_group_create(int id)
+S390PCIGroup *s390_group_create(int id, int host_id)
  {
      S390PCIGroup *group;
      S390pciState *s = s390_get_phb();
      group = g_new0(S390PCIGroup, 1);
      group->id = id;
+    group->host_id = host_id;
      QTAILQ_INSERT_TAIL(&s->zpci_groups, group, link);
      return group;
  }
@@ -771,12 +772,25 @@ S390PCIGroup *s390_group_find(int id)
      return NULL;
  }
+S390PCIGroup *s390_group_find_host_sim(int host_id)
+{
+    S390PCIGroup *group;
+    S390pciState *s = s390_get_phb();
+
+    QTAILQ_FOREACH(group, &s->zpci_groups, link) {

+ if (group->id >= ZPCI_SIM_GRP_START && group->host_id ==host_id) {

+            return group;
+        }
+    }
+    return NULL;
+}
+
  static void s390_pci_init_default_group(void)
  {
      S390PCIGroup *group;
      ClpRspQueryPciGrp *resgrp;
-    group = s390_group_create(ZPCI_DEFAULT_FN_GRP);
+    group = s390_group_create(ZPCI_DEFAULT_FN_GRP, ZPCI_DEFAULT_FN_GRP);
      resgrp = &group->zpci_group;
      resgrp->fr = 1;
      resgrp->dasm = 0;

@@ -824,6 +838,7 @@ static void s390_pcihost_realize(DeviceState *dev,Error **errp)

                                             NULL, g_free);

s->zpci_table = g_hash_table_new_full(g_int_hash, g_int_equal,NULL, NULL);

      s->bus_no = 0;
+    s->next_sim_grp = ZPCI_SIM_GRP_START;
      QTAILQ_INIT(&s->pending_sei);
      QTAILQ_INIT(&s->zpci_devs);
      QTAILQ_INIT(&s->zpci_dma_limit);
diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index c9269683f5..bdc5892287 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c

@@ -305,13 +305,17 @@ static void s390_pci_read_group(S390PCIBusDevice*pbdev,

  {
      struct vfio_info_cap_header *hdr;
      struct vfio_device_info_cap_zpci_group *cap;
+    S390pciState *s = s390_get_phb();
      ClpRspQueryPciGrp *resgrp;

VFIOPCIDevice *vpci = container_of(pbdev->pdev, VFIOPCIDevice,pdev); hdr = vfio_get_device_info_cap(info,VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);

-    /* If capability not provided, just use the default group */
-    if (hdr == NULL) {
+    /*

+ * If capability not provided or the underlying hostdev issimulated, just

+     * use the default group.
+     */
+    if (hdr == NULL || pbdev->zpci_fn.pfgid >= ZPCI_SIM_GRP_START) {
          trace_s390_pci_clp_cap(vpci->vbasedev.name,
                                 VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
          pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;

@@ -320,11 +324,41 @@ static void s390_pci_read_group(S390PCIBusDevice*pbdev,

      }
      cap = (void *) hdr;
+    /*

+ * For an intercept device, let's use an existing simulated groupif one+ * one was already created for other intercept devices in thisgroup.

+     * If not, create a new simulated group if any are still available.
+     * If all else fails, just fall back on the default group.
+     */
+    if (!pbdev->interp) {

+ pbdev->pci_group =s390_group_find_host_sim(pbdev->zpci_fn.pfgid);

+        if (pbdev->pci_group) {
+            /* Use existing simulated group */
+            pbdev->zpci_fn.pfgid = pbdev->pci_group->id;
+            return;
+        } else {
+            if (s->next_sim_grp == ZPCI_DEFAULT_FN_GRP) {
+                /* All out of simulated groups, use default */
+                trace_s390_pci_clp_cap(vpci->vbasedev.name,
+                                       VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
+                pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
+                pbdev->pci_group = s390_group_find(ZPCI_DEFAULT_FN_GRP);
+                return;
+            } else {
+                /* We can assign a new simulated group */
+                pbdev->zpci_fn.pfgid = s->next_sim_grp;
+                s->next_sim_grp++;

+ /* Fall through to create the new sim group using CLPinfo */

+            }
+        }
+    }
+
      /* See if the PCI group is already defined, create if not */
      pbdev->pci_group = s390_group_find(pbdev->zpci_fn.pfgid);
      if (!pbdev->pci_group) {
-        pbdev->pci_group = s390_group_create(pbdev->zpci_fn.pfgid);
+        pbdev->pci_group = s390_group_create(pbdev->zpci_fn.pfgid,
+                                             pbdev->zpci_fn.pfgid);
          resgrp = &pbdev->pci_group->zpci_group;
          if (cap->flags & VFIO_DEVICE_INFO_ZPCI_FLAG_REFRESH) {

diff --git a/include/hw/s390x/s390-pci-bus.hb/include/hw/s390x/s390-pci-bus.h

index 9941ca0084..8664023d5d 100644
--- a/include/hw/s390x/s390-pci-bus.h
+++ b/include/hw/s390x/s390-pci-bus.h
@@ -315,13 +315,16 @@ typedef struct ZpciFmb {

QEMU_BUILD_BUG_MSG(offsetof(ZpciFmb, fmt0) != 48, "padding inZpciFmb");

  #define ZPCI_DEFAULT_FN_GRP 0xFF
+#define ZPCI_SIM_GRP_START 0xF0
  typedef struct S390PCIGroup {
      ClpRspQueryPciGrp zpci_group;
      int id;
+    int host_id;
      QTAILQ_ENTRY(S390PCIGroup) link;
  } S390PCIGroup;
-S390PCIGroup *s390_group_create(int id);
+S390PCIGroup *s390_group_create(int id, int host_id);
  S390PCIGroup *s390_group_find(int id);
+S390PCIGroup *s390_group_find_host_sim(int host_id);
  struct S390PCIBusDevice {
      DeviceState qdev;
@@ -370,6 +373,7 @@ struct S390pciState {
      QTAILQ_HEAD(, S390PCIBusDevice) zpci_devs;
      QTAILQ_HEAD(, S390PCIDMACount) zpci_dma_limit;
      QTAILQ_HEAD(, S390PCIGroup) zpci_groups;
+    uint8_t next_sim_grp;
  };
  S390pciState *s390_get_phb(void);

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 08/12] s390x/pci: don't fence interpreted devices without MSI-X, (continued)
- [PATCH 09/12] s390x/pci: enable adapter event notification for interpreted devices, Matthew Rosato, 2021/12/07
  - Re: [PATCH 09/12] s390x/pci: enable adapter event notification for interpreted devices, Thomas Huth, 2021/12/08
    - Re: [PATCH 09/12] s390x/pci: enable adapter event notification for interpreted devices, Matthew Rosato, 2021/12/08
- [PATCH 11/12] s390x/pci: use dtsm provided from vfio capabilities for interpreted devices, Matthew Rosato, 2021/12/07
  - Re: [PATCH 11/12] s390x/pci: use dtsm provided from vfio capabilities for interpreted devices, Pierre Morel, 2021/12/15
- [PATCH 10/12] s390x/pci: use I/O Address Translation assist when interpreting, Matthew Rosato, 2021/12/07
  - Re: [PATCH 10/12] s390x/pci: use I/O Address Translation assist when interpreting, Pierre Morel, 2021/12/16
- [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups, Matthew Rosato, 2021/12/07
  - Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups, Pierre Morel, 2021/12/16
    - Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups, Matthew Rosato <=
    - Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups, Pierre Morel, 2021/12/17
- Re: [PATCH 00/12] s390x/pci: zPCI interpretation support, Pierre Morel, 2021/12/15
  - Re: [PATCH 00/12] s390x/pci: zPCI interpretation support, Matthew Rosato, 2021/12/15
    - Re: [PATCH 00/12] s390x/pci: zPCI interpretation support, Christian Borntraeger, 2021/12/17

Prev by Date: Re: [PATCH v5 10/31] block.c: modify .attach and .detach callbacks of child_of_bds
Next by Date: Re: [PATCH v5 03/31] assertions for block global state API
Previous by thread: Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups
Next by thread: Re: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups
Index(es):
- Date
- Thread