Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines

From:	Laszlo Ersek
Subject:	Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
Date:	Mon, 5 Sep 2016 18:24:48 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

On 09/01/16 15:22, Marcel Apfelbaum wrote:
> Proposes best practices on how to use PCIe/PCI device
> in PCIe based machines and explain the reasoning behind them.
> 
> Signed-off-by: Marcel Apfelbaum <address@hidden>
> ---
> 
> Hi,
> 
> Please add your comments on what to add/remove/edit to make this doc usable.

I'll give you a brain dump below -- most of it might easily be
incorrect, but I'll just speak my mind :)

> 
> Thanks,
> Marcel
> 
>  docs/pcie.txt | 145 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 145 insertions(+)
>  create mode 100644 docs/pcie.txt
> 
> diff --git a/docs/pcie.txt b/docs/pcie.txt
> new file mode 100644
> index 0000000..52a8830
> --- /dev/null
> +++ b/docs/pcie.txt
> @@ -0,0 +1,145 @@
> +PCI EXPRESS GUIDELINES
> +======================
> +
> +1. Introduction
> +================
> +The doc proposes best practices on how to use PCIe/PCI device
> +in PCIe based machines and explains the reasoning behind them.

General request: please replace all occurrences of "PCIe" with "PCI
Express" in the text (not command lines, of course). The reason is that
the "e" letter is a minimal difference, and I've misread PCIe as PC
several times, while interpreting this document. Obviously the resultant
confusion is terrible, as you are explaining the difference between PCI
and PCI Express in the entire document :)

> +
> +
> +2. Device placement strategy
> +============================
> +QEMU does not have a clear socket-device matching mechanism
> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
> +Plugging a PCI device into a PCIe device might not always work and

s/PCIe device/PCI Express slot/

> +is weird anyway since it cannot be done for "bare metal".
> +Plugging a PCIe device into a PCI slot will hide the Extended
> +Configuration Space thus is also not recommended.
> +
> +The recommendation is to separate the PCIe and PCI hierarchies.
> +PCIe devices should be plugged only into PCIe Root Ports and
> +PCIe Downstream ports (let's call them PCIe ports).

Please do not use the shorthand; we should always spell out downstream
ports and root ports. Assume people reading this document are dumber
than I am wrt. PCI / PCI Express -- I'm already pretty dumb, and I
appreciate the detail! :) If they are smart, they won't mind the detail;
if they lack expertise, they'll appreciate the detail, won't they. :)

> +
> +2.1 Root Bus (pcie.0)

Can we call this Root Complex instead?

> +=====================
> +Plug only legacy PCI devices as Root Complex Integrated Devices
> +even if the PCIe spec does not forbid PCIe devices.

I suggest "even though the PCI Express spec does not forbid PCI Express
devices as Integrated Devices". (Detail is good!)

Also, as Peter suggested, this (but not just this) would be a good place
to provide command line fragments.

> The existing
> +hardware uses mostly PCI devices as Integrated Endpoints. In this
> +way we may avoid some strange Guest OS-es behaviour.
> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream ports)
> +or DMI-PCI bridges to start legacy PCI hierarchies.

Hmmmm, I had to re-read this paragraph (while looking at the diagram)
five times until I mostly understood it :) What about the following wording:

--------
Place only the following kinds of devices directly on the Root Complex:

(1) For devices with dedicated, specific functionality (network card,
graphics card, IDE controller, etc), place only legacy PCI devices on
the Root Complex. These will be considered Integrated Endpoints.
Although the PCI Express spec does not forbid PCI Express devices as
Integrated Endpoints, existing hardware mostly integrates legacy PCI
devices with the Root Complex. Guest OSes are suspected to behave
strangely when PCI Express devices are integrated with the Root Complex.

(2) PCI Express Root Ports, for starting exclusively PCI Express
hierarchies.

(3) PCI Express Switches (connected with their Upstream Ports to the
Root Complex), also for starting exclusively PCI Express hierarchies.

(4) For starting legacy PCI hierarchies: DMI-PCI bridges.

> +
> +
> +   pcie.0 bus

"bus" is correct in QEMU lingo, but I'd still call it complex here.

> +   --------------------------------------------------------------------------
> +        |                |                    |                   |
> +   -----------   ------------------   ------------------  ------------------
> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI bridge |
> +   -----------   ------------------   ------------------  ------------------
> +

Please insert a separate (brief) section here about pxb-pcie devices --
just mention that they are documented in a separate spec txt in more
detail, and that they create new root complexes in practice.

In fact, maybe option (5) would be better for pxb-pcie devices, under
section 2.1, than a dedicated section!

> +2.2 PCIe only hierarchy
> +=======================
> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe switches 
> (Upstream
> +Ports + several Downstream Ports) if out of PCIe Root Ports slots. PCIe 
> switches
> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe Ports.

- Please name the maximum number of the root ports that's allowed on the
root complex (cmdline example?)

- Also, this is the first time you mention "slot". While the PCI Express
spec allows for root ports / downstream ports not implementing a slot
(IIRC), I think we shouldn't muddy the waters here, and restrict the
word "slot" to the command line examples only.

- What you say here about switches (upstream ports) matches what I've
learned from you thus far :), but it doesn't match bullet (3) in section
2.1. That is, if we suggest to *always* add a Root Port between the Root
Complex and the Upstream Port of a switch, then (3) should not be
present in section 2.1. (Do we suggest that BTW?)

We're not giving a technical description here (the PCI Express spec is
good enough for that), we're dictating policy. We shouldn't be shy about
minimizing the accepted use cases.

Our main guidance here should be the amount of bus numbers used up by
the hierarchy. Parts of the document might later apply to
qemu-system-aarch64 -M virt, and that machine is severely starved in the
bus numbers department (it has MMCONFIG space for 16 buses only!)

So how about this:

* the basic idea is good I think: always go for root ports, unless the
root complex is fully populated

* if you run out of root ports, use a switch with downstream ports, but
plug the upstream port directly in the root complex (make it an
integrated device). This would save us a bus number, and match option
(3) in section 2.1, but it doesn't match the diagram below, where a root
port is between the root complex and the upstream port. (Of course, if a
root port is *required* there, then 2.1 (3) is wrong, and should be
removed.)

* the "population algorithm" should be laid out in a bit more detail.
You mention a possible depth of 6-7, but I think it would be best to
keep the hierarchy as flat as possible (let's not waste bus numbers on
upstream ports, and time on deep enumeration!). In other words, only
plug upstream ports in the root complex (and without intervening root
ports, if that's allowed). For example:

-  1-32 ports needed: use root ports only

- 33-64 ports needed: use 31 root ports, and one switch with 2-32
downstream ports

- 65-94 ports needed: use 30 root ports, one switch with 32 downstream
ports, another switch with 3-32 downstream ports

- 95-125 ports needed: use 29 root ports, two switches with 32
downstream ports each, and a third switch with 2-32 downstream ports

- 126-156 ports needed: use 28 root ports, three switches with 32
downstream ports each, and a fourth switch with 2-32 downstream ports

- 157-187 ports needed: use 27 root ports, four switches with 32
downstream ports each, and a fifth switch with 2-32 downstream ports

- 188-218 ports: 26 root ports, 5 fully populated switches, sixth switch
with 2-32 downstream ports,

- 219-249 ports: 25 root ports, 6 fully pop. switches, seventh switch
with 2-32 downstream ports

(And I think this is where it ends, because the 7 upstream ports total
in the switches take up 7 bus numbers, so we'd need 249 + 7 = 256 bus
numbers, not counting the root complex, so 249 ports isn't even attainable.)

You might argue that this is way too detailed, but with the "problem
space" offering so much freedom (consider libvirt too...), I think it
would be helpful. This would also help trim the "explorations" of
downstream QE departments :)

> +
> +
> +   pcie.0 bus
> +   ----------------------------------------------------
> +        |                |               |
> +   -------------   -------------   -------------
> +   | Root Port |   | Root Port |   | Root Port |
> +   ------------   --------------   -------------
> +         |                               |
> +    ------------                 -----------------
> +    | PCIe Dev |                 | Upstream Port |
> +    ------------                 -----------------
> +                                  |            |
> +                     -------------------    -------------------
> +                     | Downstream Port |    | Downstream Port |
> +                     -------------------    -------------------
> +                             |
> +                         ------------
> +                         | PCIe Dev |
> +                         ------------
> +

So the upper right root port should be removed, probably.

Also, I recommend to draw a "container" around the upstream port plus
the two downstream ports, and tack a "switch" label to it.

> +2.3 PCI only hierarchy
> +======================
> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI bridges
> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
> +only into pcie.0 bus.
> +
> +   pcie.0 bus
> +   ----------------------------------------------
> +        |                            |
> +   -----------               ------------------
> +   | PCI Dev |               | DMI-PCI BRIDGE |
> +   ----------                ------------------
> +                               |            |
> +                        -----------    ------------------
> +                        | PCI Dev |    | PCI-PCI Bridge |
> +                        -----------    ------------------
> +                                         |           |
> +                                  -----------     -----------
> +                                  | PCI Dev |     | PCI Dev |
> +                                  -----------     -----------

Works for me, but I would again elaborate a little bit on keeping the
hierarchy flat.

First, in order to preserve compatibility with libvirt's current
behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
even if that's possible otherwise. Let's just say

- there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
is required),

- only PCI-PCI bridges should be plugged into the DMI-PCI bridge,

- let's recommend that each PCI-PCI bridge be populated until it becomes
full, at which point another PCI-PCI bridge should be plugged into the
same one DMI-PCI bridge. Theoretically, with 32 legacy PCI devices per
PCI-PCI bridge, and 32 PCI-PCI bridges stuffed into the one DMI-PCI
bridge, we could have ~1024 legacy PCI devices (not counting the
integrated ones on the root complex(es)). There's also multi-function,
so I can't see anyone needing more than this.

For practical reasons though (see later), we should state here that we
recommend no more than 9 (nine) PCI-PCI bridges in total, all located
directly under the 1 (one) DMI-PCI bridge that is integrated into the
pcie.0 root complex. Nine PCI-PCI bridges should allow for 288 legacy
PCI devices. (And then there's multifunction.)

> +
> +
> +
> +3. IO space issues
> +===================
> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and

(please spell out downstream + root port)

> +as required by PCI spec will reserve a 4K IO range for each.
> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
> +it by allocation the IO space only if there is at least a device
> +with IO BARs plugged into the bridge.

This used to be true, but is no longer true, for OVMF. And I think it's
actually correct: we *should* keep the 4K IO reservation per PCI-PCI bridge.

(But, certainly no IO reservation for PCI Express root port, upstream
port, or downstream port! And i'll need your help for telling these
apart in OVMF.)

Let me elaborate more under section "4. Hot Plug". For now let me just
say that I'd like this language about optimization to be dropped.

> +Behind a PCIe PORT only one device may be plugged, resulting in

(again, please spell out downstream and root port)

> +the allocation of a whole 4K range for each device.
> +The IO space is limited resulting in ~10 PCIe ports per system

(limited to 65536 byte-wide IO ports, but it's fragmented, so we have
about 10 * 4K free)

> +if devices with IO BARs are plugged into IO ports.

not "into IO ports" but "into PCI Express downstream and root ports".

> +
> +Using the proposed device placing strategy solves this issue

> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires

(please spell out root / downstream etc)

> +PCIe devices to work without IO BARs.
> +The PCI hierarchy has no such limitations.

I'm sorry to have fragmented this section with so many comments, but the
idea is actually splendid, in my opinion!

... Okay, still speaking resources, could you insert a brief section
here about bus numbers? Under "3. IO space issues", you already explain
how "practically everything" qualifies as a PCI bridge. We should
mention that all those things, such as:
- root complex pcie.0,
- root complex added by pxb-pcie,
- root ports,
- upstream ports, downstream ports,
- bridges, etc

take up bus numbers, and we have 256 bus numbers in total.

In the next section you state that PCI hotplug (ACPI based) and PCI
Express hotplug (native) can work side by side, which is correct, and
the IO space competition is eliminated by the scheme proposed in section
3 -- and the MMIO space competition is "obvious" --, but the bus number
starvation is *very much* non-obvious. It should be spelled out. I think
it deserves a separate section. (Again, with an eye toward
qemu-system-aarch64 -M virt -- we've seen PCI Express failures there,
and they were due to bus number starvation. It wasn't fun to debug.
(Well, it was, but don't tell anyone :)))

> +
> +
> +4. Hot Plug
> +============
> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,

s/root bus/root complex/? Also, any root complexes added with pxb-pcie
don't support hotplug.

> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.

I would say: ... so anything that plugs *only* into a root complex,
cannot be hotplugged. Then the list is what you mention here (also
referring back to options (1), (2) and (4) in section 2.1), *plus* I
would also add option (5): pxb-pcie can also not be hotplugged.

> +
> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
> +in QEMU preventing it to work, but it would be solved soon).

What bug?

Anyway, I'm unsure we should add this remark here -- it's a guide, not a
status report. I'm worried that whenever we fix that bug, we forget to
remove this remark.

> +The PCI hotplug is ACPI based and can work side by side with the PCIe
> +native hotplug.
> +
> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.

I would mention the order (upstream port, downstream port), also add
some command lines maybe.

> +Keep in mind you always need to have at least one PCIe Port available
> +for hotplug, the PCIe Ports themselves are not hot-pluggable.

Well, the downstream ports of a switch that is being added *are*, aren't
they?

But, this question is actually irrelevant IMO, because here I would add
another subsection about *planning* for hot-plug. (I think that's pretty
important.) And those plans should make the hotplugging of switches
unnecessary!

* For the PCI Express hierarchy, I recommended a flat structure above.
The 256 bus numbers can easily be exhausted / covered by 25 root ports
plus 7 switches (each switch being fully populated with downstream
ports). This should allow all sysadmins to estimate their expected
numbers of hotplug PCI Express devices, in advance, and create enough
root ports / downstream ports. (Sysadmins are already used to planning
for hotplug, see VCPUs, memory (DIMM), memory (balloon).)

* For the PCI hierarchy, it should be even simpler, but worth a mention
-- start with enough PCI-PCI bridges under the one DMI-PCI bridge.

* Finally, this is the spot where we should design and explain our
resource reservation for hotplug:

  - For PCI Express hotplug, please explain that only such PCI Express
    devices can be hotplugged that require no IO space -- in section
    "3. IO space issues" you mention that this is a valid restriction.
    Furthermore, please state the MMIO32 and/or MMIO64 sizes that the
    firmware needs to reserve for root ports / downstream ports, and
    also explain that these sizes will act as maximum size and
    alignment limits for *individual* hotplug devices.

    We can invent fw_cfg switches for this, or maybe even a special PCI
    Express capability (to be placed in the config space of root and
    downstream ports).

  - For legacy PCI hotplug, this is where my evil plan of "no more than
    9 (nine) PCI-PCI bridges under the 1 (one) DMI-PCI bridge" unfolds!

    We've stated above (in Section 3) that we have about 10*4KB IO port
    space. One of those 4K chunks will go (collectively) to the
    Integrated PCI devices that sit on pcie.0. (If there are other root
    complexes from pxb-pcie devices, then those will get one chunk each
    too.) The rest -- hence assume 9 or fewer chunks -- will be consumed
    by the 9 (or respectively fewer) PCI-PCI bridges, for hotplug
    reservation. The upshot is that as long as a sysadmin sticks with
    our flat, "9 PCI-PCI bridges total" recommendation for the legacy

    PCI hierarchy, the IO reservation will be covered *immediately*.
    Simply put: don't create more PCI-PCI bridges than you have IO
    space for -- and that should leave you with about 9 "sibling"
    bridges, which are plenty enough for a huge number of legacy PCI
    devices!

    Furthermore, please state the MMIO32 / MMIO64 amount to reserve
    *per PCI-PCI bridge*. The firmware programmers need to know this,
    and people planning for legacy PCI hotplug should be informed that
    those limits are for all devices *together* on the same PCI-PCI
    bridge.

    Again, we could expose this via fw_cfg, or in a special capability
    (as suggested by Gerd IIRC) in the PCI config space of the PCI-PCI
    bridge.

> +
> +
> +5. Device assignment
> +====================
> +Host devices are mostly PCIe and should be plugged only into PCIe ports.
> +PCI-PCI bridge slots can be used for legacy PCI host devices.

Please provide a command line (lspci) so that users can easily determine
if the device they wish to assign is legacy PCI or PCI Express.

> +
> +
> +6. Virtio devices
> +=================
> +Virtio devices plugged into the PCI hierarchy or as an Integrated Devices

(drop "an")

> +will remain PCI and have transitional behaviour as default.

(Please add one sentence about what "transitional" means in this context
-- they'll have both IO and MMIO BARs.)

> +Virtio devices plugged into PCIe ports are Express devices and have
> +"1.0" behavior by default without IO support.
> +In both case disable-* properties can be used to override the behaviour.

Please emphasize that setting disable-legacy=off (that is, enabling
legacy behavior) for PCI Express virtio devices will cause them to
require IO space, which, given our PCI Express hierarchy, may quickly
lead to resource exhaustion, and is therefore strongly discouraged.

> +
> +
> +7. Conclusion
> +==============
> +The proposal offers a usage model that is easy to understand and follow
> +and in the same time overcomes some PCIe limitations.

I agree!

Thanks!
Laszlo

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Marcel Apfelbaum, 2016/09/01
- Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Peter Maydell, 2016/09/01
  - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Marcel Apfelbaum, 2016/09/01
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Laszlo Ersek, 2016/09/01
- Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Laszlo Ersek <=
  - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Marcel Apfelbaum, 2016/09/05
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Laszlo Ersek, 2016/09/06
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Marcel Apfelbaum, 2016/09/06
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Gerd Hoffmann, 2016/09/07
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Laszlo Ersek, 2016/09/07
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Marcel Apfelbaum, 2016/09/07
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Marcel Apfelbaum, 2016/09/07
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Alex Williamson, 2016/09/07
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Marcel Apfelbaum, 2016/09/07
    - Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines, Laine Stump, 2016/09/07

Prev by Date: Re: [Qemu-devel] [PATCH v4 00/10] arm: add ast2500 support
Next by Date: Re: [Qemu-devel] [RFC PATCH] hw/intc/arm_gic: handle Set-Active/Clear-Active registers
Previous by thread: Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
Next by thread: Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
Index(es):
- Date
- Thread