qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridg


From: David Gibson
Subject: Re: [Qemu-ppc] [RFC 1/4] spapr_pci: Delegate placement of PCI host bridges to machine type
Date: Fri, 7 Oct 2016 20:17:54 +1100
User-agent: Mutt/1.7.0 (2016-08-17)

On Fri, Oct 07, 2016 at 04:34:59PM +1100, Alexey Kardashevskiy wrote:
> On 07/10/16 16:10, David Gibson wrote:
> > On Fri, Oct 07, 2016 at 02:57:43PM +1100, Alexey Kardashevskiy wrote:
> >> On 06/10/16 14:03, David Gibson wrote:
> >>> The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB)
> >>> for a PAPR guest.  Unlike on x86, it's routine on Power (both bare metal
> >>> and PAPR guests) to have numerous independent PHBs, each controlling a
> >>> separate PCI domain.
> >>>
> >>> There are two ways of configuring the spapr-pci-host-bridge device: first
> >>> it can be done fully manually, specifying the locations and sizes of all
> >>> the IO windows.  This gives the most control, but is very awkward with 6
> >>> mandatory parameters.  Alternatively just an "index" can be specified
> >>> which essentially selects from an array of predefined PHB locations.
> >>> The PHB at index 0 is automatically created as the default PHB.
> >>>
> >>> The current set of default locations causes some problems for guests with
> >>> large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia
> >>> GPGPU cards via VFIO).  Obviously, for migration we can only change the
> >>> locations on a new machine type, however.
> >>>
> >>> This is awkward, because the placement is currently decided within the
> >>> spapr-pci-host-bridge code, so it breaks abstraction to look inside the
> >>> machine type version.
> >>>
> >>> So, this patch delegates the "default mode" PHB placement from the
> >>> spapr-pci-host-bridge device back to the machine type via a public method
> >>> in sPAPRMachineClass.  It's still a bit ugly, but it's about the best we
> >>> can do.
> >>>
> >>> For now, this just changes where the calculation is done.  It doesn't
> >>> change the actual location of the host bridges, or any other behaviour.
> >>>
> >>> Signed-off-by: David Gibson <address@hidden>
> >>> ---
> >>>  hw/ppc/spapr.c              | 34 ++++++++++++++++++++++++++++++++++
> >>>  hw/ppc/spapr_pci.c          | 22 ++++++++--------------
> >>>  include/hw/pci-host/spapr.h | 11 +----------
> >>>  include/hw/ppc/spapr.h      |  4 ++++
> >>>  4 files changed, 47 insertions(+), 24 deletions(-)
> >>>
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 03e3803..f6e9c2a 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -2370,6 +2370,39 @@ static HotpluggableCPUList 
> >>> *spapr_query_hotpluggable_cpus(MachineState *machine)
> >>>      return head;
> >>>  }
> >>>  
> >>> +static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
> >>> +                                uint64_t *buid, hwaddr *pio, hwaddr 
> >>> *pio_size,
> >>> +                                hwaddr *mmio, hwaddr *mmio_size,
> >>> +                                unsigned n_dma, uint32_t *liobns, Error 
> >>> **errp)
> >>> +{
> >>> +    const uint64_t base_buid = 0x800000020000000ULL;
> >>> +    const hwaddr phb0_base = 0x10000000000ULL; /* 1 TiB */
> >>> +    const hwaddr phb_spacing = 0x1000000000ULL; /* 64 GiB */
> >>> +    const hwaddr mmio_offset = 0xa0000000; /* 2 GiB + 512 MiB */
> >>> +    const hwaddr pio_offset = 0x80000000; /* 2 GiB */
> >>> +    const uint32_t max_index = 255;
> >>> +
> >>> +    hwaddr phb_base;
> >>> +    int i;
> >>> +
> >>> +    if (index > max_index) {
> >>> +        error_setg(errp, "\"index\" for PAPR PHB is too large (max %u)",
> >>> +                   max_index);
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    *buid = base_buid + index;
> >>> +    for (i = 0; i < n_dma; ++i) {
> >>> +        liobns[i] = SPAPR_PCI_LIOBN(index, i);
> >>> +    }
> >>> +
> >>> +    phb_base = phb0_base + index * phb_spacing;
> >>> +    *pio = phb_base + pio_offset;
> >>> +    *pio_size = SPAPR_PCI_IO_WIN_SIZE;
> >>> +    *mmio = phb_base + mmio_offset;
> >>> +    *mmio_size = SPAPR_PCI_MMIO_WIN_SIZE;
> >>> +}
> >>> +
> >>>  static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >>>  {
> >>>      MachineClass *mc = MACHINE_CLASS(oc);
> >>> @@ -2406,6 +2439,7 @@ static void spapr_machine_class_init(ObjectClass 
> >>> *oc, void *data)
> >>>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> >>>      fwc->get_dev_path = spapr_get_fw_dev_path;
> >>>      nc->nmi_monitor_handler = spapr_nmi;
> >>> +    smc->phb_placement = spapr_phb_placement;
> >>>  }
> >>>  
> >>>  static const TypeInfo spapr_machine_info = {
> >>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>> index 4f00865..c0fc964 100644
> >>> --- a/hw/ppc/spapr_pci.c
> >>> +++ b/hw/ppc/spapr_pci.c
> >>> @@ -1311,7 +1311,8 @@ static void spapr_phb_realize(DeviceState *dev, 
> >>> Error **errp)
> >>>          sphb->ddw_enabled ? SPAPR_PCI_DMA_MAX_WINDOWS : 1;
> >>>  
> >>>      if (sphb->index != (uint32_t)-1) {
> >>> -        hwaddr windows_base;
> >>> +        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> >>> +        Error *local_err = NULL;
> >>>  
> >>>          if ((sphb->buid != (uint64_t)-1) || (sphb->dma_liobn[0] != 
> >>> (uint32_t)-1)
> >>>              || (sphb->dma_liobn[1] != (uint32_t)-1 && windows_supported 
> >>> == 2)
> >>> @@ -1322,21 +1323,14 @@ static void spapr_phb_realize(DeviceState *dev, 
> >>> Error **errp)
> >>>              return;
> >>>          }
> >>>  
> >>> -        if (sphb->index > SPAPR_PCI_MAX_INDEX) {
> >>> -            error_setg(errp, "\"index\" for PAPR PHB is too large (max 
> >>> %u)",
> >>> -                       SPAPR_PCI_MAX_INDEX);
> >>> +        smc->phb_placement(spapr, sphb->index,
> >>> +                           &sphb->buid, &sphb->io_win_addr, 
> >>> &sphb->io_win_size,
> >>> +                           &sphb->mem_win_addr, &sphb->mem_win_size,
> >>> +                           windows_supported, sphb->dma_liobn, 
> >>> &local_err);
> >>> +        if (local_err) {
> >>> +            error_propagate(errp, local_err);
> >>>              return;
> >>>          }
> >>> -
> >>> -        sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
> >>> -        for (i = 0; i < windows_supported; ++i) {
> >>> -            sphb->dma_liobn[i] = SPAPR_PCI_LIOBN(sphb->index, i);
> >>> -        }
> >>> -
> >>> -        windows_base = SPAPR_PCI_WINDOW_BASE
> >>> -            + sphb->index * SPAPR_PCI_WINDOW_SPACING;
> >>> -        sphb->mem_win_addr = windows_base + SPAPR_PCI_MMIO_WIN_OFF;
> >>> -        sphb->io_win_addr = windows_base + SPAPR_PCI_IO_WIN_OFF;
> >>>      }
> >>>  
> >>>      if (sphb->buid == (uint64_t)-1) {
> >>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> >>> index 30dbd46..8c9ebfd 100644
> >>> --- a/include/hw/pci-host/spapr.h
> >>> +++ b/include/hw/pci-host/spapr.h
> >>> @@ -79,18 +79,9 @@ struct sPAPRPHBState {
> >>>      uint32_t numa_node;
> >>>  };
> >>>  
> >>> -#define SPAPR_PCI_MAX_INDEX          255
> >>> -
> >>> -#define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
> >>> -
> >>>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> >>>  
> >>> -#define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> >>> -#define SPAPR_PCI_WINDOW_SPACING     0x1000000000ULL
> >>> -#define SPAPR_PCI_MMIO_WIN_OFF       0xA0000000
> >>> -#define SPAPR_PCI_MMIO_WIN_SIZE      (SPAPR_PCI_WINDOW_SPACING - \
> >>> -                                     SPAPR_PCI_MEM_WIN_BUS_OFFSET)
> >>> -#define SPAPR_PCI_IO_WIN_OFF         0x80000000
> >>> +#define SPAPR_PCI_MMIO_WIN_SIZE      0xf80000000
> >>>  #define SPAPR_PCI_IO_WIN_SIZE        0x10000
> >>>  
> >>>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
> >>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>> index 39dadaa..9e1c5a5 100644
> >>> --- a/include/hw/ppc/spapr.h
> >>> +++ b/include/hw/ppc/spapr.h
> >>> @@ -40,6 +40,10 @@ struct sPAPRMachineClass {
> >>>      bool dr_lmb_enabled;       /* enable dynamic-reconfig/hotplug of 
> >>> LMBs */
> >>>      bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
> >>>      const char *tcg_default_cpu; /* which (TCG) CPU to simulate by 
> >>> default */
> >>> +    void (*phb_placement)(sPAPRMachineState *spapr, uint32_t index,
> >>> +                          uint64_t *buid, hwaddr *pio, hwaddr *pio_size,
> >>> +                          hwaddr *mmio, hwaddr *mmio_size,
> >>> +                          unsigned n_dma, uint32_t *liobns, Error 
> >>> **errp);
> >>
> >>
> >> I still like idea of getting rid of index better. In order to reduce the
> >> command line length, you could not enable IO and MMIO32 by default, the
> >> default size for MMIO64 could be set to 1TB or something like this, BUID
> >> could be just a raw MMIO64 address value.
> > 
> > Hm, interesting idea.  I do like the idea of making the BUID equal to
> > the MMIO64 address.  Obviously we'd still need addresses for the
> > default PHB, including 32-bit and PIO windows.
> > 
> > Trickier, we'd still need to support 'index' for compatibility with
> > older command lines.  I guess we could reserve index purely for the
> > "legacy" locations - 2.8+ machine types would create the default
> > bridge with explicit windows, rather than just using index 0.
> > 
> >> So essentially you would only have to specify MMIO64 and 2xLIOBN in the
> >> command line which does not seem that much of overkill assuming that a
> >> second (third,...) PHB is almost never used and even if it is, it will be
> >> used for SRIOV VF (which can do 64bit) or virtio (the same).
> > 
> > Hm, maybe.  3 values is still a bit of a PITA.  I guess liobn64 could
> > theoretically be optional, although you're going to want it for
> > basically every case where you're creating an additional PHB.
> > 
> > Or.. what if we insisted the mmio64 window had at least 4G alignment,
> > then we might be able to use mmio64 >> 32 (== buid >> 32) as the
> > liobn.  Hmm.. guess it would need to be 8G alignment, so we have one
> > extra bit.
> 
> 
> As you mentioned in 4/4, guests cannot access above 64 TiB, we could use
> upper bits to store the window number.

*Some* guests can't access above 64 TiB, that's not an inherent
limitation.  So I'd rather not do that.

> 
> 
> > 
> > ..and I wonder if this might be easier to manage if we introduced a
> > new spapr-pci-host-bridge-2 subtype.
> > 
> > Let me think about this..
> > 
> >> With index and phb_placement(), QEMU will still auto-generate these IO/MMIO
> >> addresses from the command line parameters which did hit us with migration
> >> before as the QEMU command line on the source side and the destination side
> >> is not exactly the same, and this might hit again...
> > 
> > Uh.. I'm not quite following you.  What's the situation which would
> > break migration with the proposed scheme?
> 
> Well, I cannot think of any other device available on SPAPR which we could
> map to the CPU address space but if it ever appears, we will have a problem
> unless this new imaginary device does not pick an address from the CPU
> space automatically.

I followed that even less that the previous one, I can't even parse that.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]