Re: [Qemu-devel] QEMU PCIe link "negotiation"

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] QEMU PCIe link "negotiation"

From:	Dr. David Alan Gilbert
Subject:	Re: [Qemu-devel] QEMU PCIe link "negotiation"
Date:	Tue, 16 Oct 2018 10:49:25 +0100
User-agent:	Mutt/1.10.1 (2018-07-13)

* Alex Williamson (address@hidden) wrote:
> Hi,
> 
> I'd like to start a discussion about virtual PCIe link width and speeds
> in QEMU to figure out how we progress past the 2.5GT/s, x1 width links
> we advertise today.  This matters for assigned devices as the endpoint
> driver may not enable full physical link utilization if the upstream
> port only advertises minimal capabilities.  One GPU assignment users
> has measured that they only see an average transfer rate of 3.2GB/s
> with current code, but hacking the downstream port to advertise an
> 8GT/s, x16 width link allows them to get 12GB/s.  Obviously not all
> devices and drivers will have this dependency and see these kinds of
> improvements, or perhaps any improvement at all.
> 
> The first problem seems to be how we expose these link parameters in a
> way that makes sense and supports backwards compatibility and
> migration.  I think we want the flexibility to allow the user to
> specify per PCIe device the link width and at least the maximum link
> speed, if not the actual discrete link speeds supported.  However,
> while I want to provide this flexibility, I don't necessarily think it
> makes sense to burden the user to always specify these to get
> reasonable defaults.  So I would propose that we a) add link parameters
> to the base PCIe device class and b) set defaults based on the machine
> type.  Additionally these machine type defaults would only apply to
> generic PCIe root ports and switch ports, anything based on real
> hardware would be fixed, ex. ioh3420 would stay at 2.5GT/s, x1 unless
> overridden by the user.  Existing machine types would also stay at this
> "legacy" rate, while pc-q35-3.2 might bring all generic devices up to
> PCIe 4.0 specs, x32 width and 16GT/s, where the per-endpoint
> negotiation would bring us back to negotiated widths and speeds
> matching the endpoint.  Reasonable?
> 
> Next I think we need to look at how and when we do virtual link
> negotiation.  We're mostly discussing a virtual link, so I think
> negotiation is simply filling in the negotiated link and width with the
> highest common factor between endpoint and upstream port.  For assigned
> devices, this should match the endpoint's existing negotiated link
> parameters, however, devices can dynamically change their link speed
> (perhaps also width?), so I believe a current link seed of 2.5GT/s
> could upshift to 8GT/s without any sort of visible renegotiation.  Does
> this mean that we should have link parameter callbacks from downstream
> port to endpoint?  Or maybe the downstream port link status register
> should effectively be an alias for LNKSTA of devfn 00.0 of the
> downstream device when it exists.  We only need to report a consistent
> link status value when someone looks at it, so reading directly from
> the endpoint probably makes more sense than any sort of interface to
> keep the value current.
> 
> If we take the above approach with LNKSTA (probably also LNKSTA2), is
> any sort of "negotiation" required?  We're automatically negotiated if
> the capabilities of the upstream port are a superset of the endpoint's
> capabilities.  What do we do and what do we care about when the
> upstream port is a subset of the endpoint though?  For example, an
> 8GT/s, x16 endpoint is installed into a 2.5GT/s, x1 downstream port.
> On real hardware we obviously negotiate the endpoint down to the
> downstream port parameters.  We could do that with an emulated device,
> but this is the scenario we have today with assigned devices and we
> simply leave the inconsistency.  I don't think we actually want to
> (and there would be lots of complications to) force the physical device
> to negotiate down to match a virtual downstream port.  Do we simply
> trigger a warning that this may result in non-optimal performance and
> leave the inconsistency?
> 
> This email is already too long, but I also wonder whether we should
> consider additional vfio-pci interfaces to trigger a link retraining or
> allow virtualized access to the physical upstream port config space.
> Clearly we need to consider multi-function devices and whether there
> are useful configurations that could benefit from such access.  Thanks
> for reading, please discuss,

I'm not sure about the consequences of the actual link speeds, but I
worry we'll hit things looking for PCIe v3 features; in particular
AMD's ROCm code needs PCIe atomics:

https://github.com/RadeonOpenCompute/ROCm_Documentation/blob/master/Installation_Guide/More-about-how-ROCm-uses-PCIe-Atomics.rst

so it feels like getting that to work with passthrough would
need some negotiation of features.

Dave

> Alex
> 
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] QEMU PCIe link "negotiation", Alex Williamson, 2018/10/15
- Re: [Qemu-devel] QEMU PCIe link "negotiation", Dr. David Alan Gilbert <=
  - Re: [Qemu-devel] QEMU PCIe link "negotiation", Alex Williamson, 2018/10/16
- Re: [Qemu-devel] QEMU PCIe link "negotiation", Michael S. Tsirkin, 2018/10/16
  - Re: [Qemu-devel] QEMU PCIe link "negotiation", Alex Williamson, 2018/10/16

Prev by Date: Re: [Qemu-devel] [PATCH] target/mips: Support Toshiba specific three-operand MADD and MADDU
Next by Date: [Qemu-devel] [Bug 1798057] [NEW] Not able to start instances larger than 1 TB
Previous by thread: [Qemu-devel] QEMU PCIe link "negotiation"
Next by thread: Re: [Qemu-devel] QEMU PCIe link "negotiation"
Index(es):
- Date
- Thread