qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines


From: Marcel Apfelbaum
Subject: Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
Date: Tue, 6 Sep 2016 17:46:42 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1

On 09/06/2016 04:31 PM, Laszlo Ersek wrote:
On 09/05/16 22:02, Marcel Apfelbaum wrote:
On 09/05/2016 07:24 PM, Laszlo Ersek wrote:
On 09/01/16 15:22, Marcel Apfelbaum wrote:
Proposes best practices on how to use PCIe/PCI device
in PCIe based machines and explain the reasoning behind them.

Signed-off-by: Marcel Apfelbaum <address@hidden>
---

Hi,

Please add your comments on what to add/remove/edit to make this doc
usable.



[...]


(But, certainly no IO reservation for PCI Express root port, upstream
port, or downstream port! And i'll need your help for telling these
apart in OVMF.)


Just let me know how can I help.

Well, in the EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
implementation, I'll have to look at the PCI config space of the
"bridge-like" PCI device that the generic PCI Bus driver of edk2 passes
back to me, asking me about resource reservation.

Based on the config space, I should be able to tell apart "PCI-PCI
bridge" from "PCI Express downstream or root port". So what I'd need
here is a semi-formal natural language description of these conditions.

You can use PCI Express Spec: 7.8.2. PCI Express Capabilities Register (Offset 
02h)

Bit 7:4 Register Description:
Device/Port Type – Indicates the specific type of this PCI
Express Function. Note that different Functions in a multi-
Function device can generally be of different types.
Defined encodings are:
0000b PCI Express Endpoint
0001b Legacy PCI Express Endpoint
0100b Root Port of PCI Express Root Complex*
0101b Upstream Port of PCI Express Switch*
0110b Downstream Port of PCI Express Switch*
0111b PCI Express to PCI/PCI-X Bridge*
1000b PCI/PCI-X to PCI Express Bridge*
1001b Root Complex Integrated Endpoint




Hmm, actually I think I've already written code, for another patch, that
identifies the latter category. So everything where that check doesn't
fire can be deemed "PCI-PCI bridge". (This hook gets called only for
bridges.)

Yet another alternative: if we go for the special PCI capability, for
exposing reservation sizes from QEMU to the firmware, then I can simply
search the capability list for just that capability. I think that could
be the easiest for me.


That would be a "later" step.
BTW, following a offline chat with Michael S. Tsirkin
regarding virto 1.0 requiring 8M MMIO by default we arrived to a conclusion that
is not really needed and we came up with an alternative that will require less 
then 2M
MMIO space.
I put this here because the above solution will give us some time to deal with
the MMIO ranges reservation.

[...]

+
+
+4. Hot Plug
+============
+The root bus pcie.0 does not support hot-plug, so Integrated Devices,

s/root bus/root complex/? Also, any root complexes added with pxb-pcie
don't support hotplug.


Actually pxb-pcie should support PCI Express Native Hotplug.

Huh, interesting.

If they don't is a bug and I'll take care of it.

Hmm, a bit lower down you mention that PCI Express native hot plug is
based on SHPCs. So, when you say that pxb-pcie should support PCI
Express Native Hotplug, you mean that it should occur through SHPC, right?


Yes, but I was talking about the Integrated SHPCs of the PCI Express
Root Ports and PCI Express Downstream Ports. (devices plugged into them)


However, for pxb-pci*, we had to disable SHPC: see QEMU commit
d10dda2d60c8 ("hw/pci-bridge: disable SHPC in PXB"), in June 2015.


This is only for the pxb device (not pxb-pcie) and only for the internal 
pci-bridge that comes with it.
And... we don't use SHPC based hot-plug for PCI, only for PCI Express.
For PCI we are using only the ACPI hotplug. So disabling it is not so bad.

The pxb-pcie does not have the internal PCI bridge. You don't need it because:
1. You can't have Integrated Devices for pxb-pcie
2. The PCI Express Upstream Port is a type of PCI-Bridge anyway.


For background, the series around it was
<https://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05136.html>
-- I think v7 was the last version.

... Actually, now I wonder if d10dda2d60c8 should be possible to revert
at this point! Namely, in OVMF I may have unwittingly fixed this issue
-- obviously much later than the QEMU commit: in March 2016. See

https://github.com/tianocore/edk2/commit/8f35eb92c419

If you look at the commit message of the QEMU patch, it says

    [...]

    Unfortunately, when this happens, the PCI_COMMAND_MEMORY bit is
    clear in the root bus's command register [...]

which I think should no longer be true, thanks to edk2 commit 8f35eb92c419.

So maybe we should re-evaluate QEMU commit d10dda2d60c8. If pxb-pci and
pxb-pcie work with current OVMF, due to edk2 commit 8f35eb92c419, then
maybe we should revert QEMU commit d10dda2d60c8.

Not urgent for me :), obviously, I'm just explaining so you can make a
note for later, if you wish to (if hot-plugging directly into pxb-pcie
should be necessary -- I think it's very low priority).


As stated above, since we don't use it anyway it doesn't matter.

[...]

Nope, you cannot hotplug a PCI Express Root Port or a PCI Express
Downstream Port.
The reason: The PCI Express Native Hotplug is based on SHPCs (Standard
HotPlug Controllers)
which are integrated only in the mentioned ports and not in Upstream
Ports or the Root Complex.
The "other" reason: When you buy a switch/server it has a number of
ports and that's it.
You cannot add "later".

Makes sense, thank you. I think if you add the HMP example, it will make
it clear. I only assumed that you needed several monitor commands for
hotplugging a single switch (i.e., one command per one port) because on
the QEMU command line you do need a separate -device option for the
upstream port, and every single downstream port, of the same switch.

If, using the monitor, it's just one device_add for the upstream port,
and the downstream ports are added automatically, then I guess it'll be
easy to understand.


No it doesn't work like that, you would need to add them one by one (upstream 
ports and then downstream ports)
as far as I understand it.
Actually I've never done it before, I'll try it first and update the doc on
how it should be done. (if it can be done...)



But, this question is actually irrelevant IMO, because here I would add
another subsection about *planning* for hot-plug. (I think that's pretty
important.) And those plans should make the hotplugging of switches
unnecessary!


I'll add a subsection for it. But when you are out of options you *can*
hotplug a switch if your sysadmin skills are limited...

You probably can, but then we'll run into the resource allocation
problem again:

(1) The user will hotplug a switch (= S1) under a root port with, say,
two downstream ports (= S1-DP1, S1-DP2).

(2) They'll then plug a PCI Express device into one of those downstream
ports (S1-DP1-Dev1).

(3) Then they'll want to hot-plug *another* switch into the *other*
downstream port (S1-DP2-S2).


                         DP1 -- Dev1 (2)
                        /
     root port -- S1 (1)
                        \
                         DP2 -- S2 (3)

However, concerning the resource needs of S2 (and especially the devices
hot-plugged under S2!), S1 won't have enough left over, because Dev1
(under DP1) will have eaten into them, and Dev1's BARs will have been
programmed!


Theoretically the Guest OS should trigger PCI resources re-allocation
but I agree we should not count on them.

We could never credibly explain our way out of this situation in a bug
report. For that reason, I think we should discourage hotplug ideas that
would change the topology, and require recursive resource allocation at
higher levels and/or parallel branches of the topology.

I know Linux can do that, and it even succeeds if there is enough room,
but from the messages seen in the guest dmesg when it fails, how do you
explain to the user that they should have plugged in S2 first, and Dev1
second?

So, we should recommend *not* to hotplug switches or PCI-PCI bridges.
Instead,
- keep a very flat hierarchy from the start;
- for PCI Express, add as many root ports and downstream ports as you
deem enough for future hotplug needs (keeping the flat formula I described);
- for legacy PCI, add as many sibling PCI-PCI bridges directly under the
one DMI-PCI bridge as you deem sufficient for future hotplug needs.

In short, don't change the hierarchy at runtime by hotplugging internal
nodes; hotplug *leaf nodes* only.


Agreed. I'll re-use some of your comments in the doc.



[...]


Gerd explicitly asked for the second idea (vendor specific capability)

Nice, thank you for confirming it; let's do this then. It will also
simplify my work in the
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() function: it should
suffice to scan the config space of the bridge, regardless of the
"PCI-PCI bridge / PCI Express root or downstream port" distinction.


Will do, but since we have a quick way to deal with the current issue
(virtio 1.0 requiring 8MB MMIO while firmware reserving 2MB for PCI-Bridges 
hotplug)

[...]

Thanks,
Marcel


Cheers!
Laszlo





reply via email to

[Prev in Thread] Current Thread [Next in Thread]