qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines


From: Laszlo Ersek
Subject: Re: [Qemu-devel] [PATCH RFC] docs: add PCIe devices placement guidelines
Date: Tue, 6 Sep 2016 15:31:55 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

On 09/05/16 22:02, Marcel Apfelbaum wrote:
> On 09/05/2016 07:24 PM, Laszlo Ersek wrote:
>> On 09/01/16 15:22, Marcel Apfelbaum wrote:
>>> Proposes best practices on how to use PCIe/PCI device
>>> in PCIe based machines and explain the reasoning behind them.
>>>
>>> Signed-off-by: Marcel Apfelbaum <address@hidden>
>>> ---
>>>
>>> Hi,
>>>
>>> Please add your comments on what to add/remove/edit to make this doc
>>> usable.
>>
> 
> Hi Laszlo,
> 
>> I'll give you a brain dump below -- most of it might easily be
>> incorrect, but I'll just speak my mind :)
>>
> 
> Thanks for taking the time to go over it, I'll do my best to respond
> to all the questions.
> 
>>>
>>> Thanks,
>>> Marcel
>>>
>>>  docs/pcie.txt | 145
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 145 insertions(+)
>>>  create mode 100644 docs/pcie.txt
>>>
>>> diff --git a/docs/pcie.txt b/docs/pcie.txt
>>> new file mode 100644
>>> index 0000000..52a8830
>>> --- /dev/null
>>> +++ b/docs/pcie.txt
>>> @@ -0,0 +1,145 @@
>>> +PCI EXPRESS GUIDELINES
>>> +======================
>>> +
>>> +1. Introduction
>>> +================
>>> +The doc proposes best practices on how to use PCIe/PCI device
>>> +in PCIe based machines and explains the reasoning behind them.
>>
>> General request: please replace all occurrences of "PCIe" with "PCI
>> Express" in the text (not command lines, of course). The reason is that
>> the "e" letter is a minimal difference, and I've misread PCIe as PC
>> several times, while interpreting this document. Obviously the resultant
>> confusion is terrible, as you are explaining the difference between PCI
>> and PCI Express in the entire document :)
>>
> 
> Sure
> 
>>> +
>>> +
>>> +2. Device placement strategy
>>> +============================
>>> +QEMU does not have a clear socket-device matching mechanism
>>> +and allows any PCI/PCIe device to be plugged into any PCI/PCIe slot.
>>> +Plugging a PCI device into a PCIe device might not always work and
>>
>> s/PCIe device/PCI Express slot/
>>
> 
> Thanks!
> 
>>> +is weird anyway since it cannot be done for "bare metal".
>>> +Plugging a PCIe device into a PCI slot will hide the Extended
>>> +Configuration Space thus is also not recommended.
>>> +
>>> +The recommendation is to separate the PCIe and PCI hierarchies.
>>> +PCIe devices should be plugged only into PCIe Root Ports and
>>> +PCIe Downstream ports (let's call them PCIe ports).
>>
>> Please do not use the shorthand; we should always spell out downstream
>> ports and root ports. Assume people reading this document are dumber
>> than I am wrt. PCI / PCI Express -- I'm already pretty dumb, and I
>> appreciate the detail! :) If they are smart, they won't mind the detail;
>> if they lack expertise, they'll appreciate the detail, won't they. :)
>>
> 
> Sure
> 
>>> +
>>> +2.1 Root Bus (pcie.0)
>>
>> Can we call this Root Complex instead?
>>
> 
> Sorry, but we can't. The Root Complex is a type of Host-Bridge
> (and can actually "have" multiple Host-Bridges), not a bus.
> It stands between the CPU/Memory controller/APIC and the PCI/PCI Express
> fabric.
> (as you can see, I am not using PCIe even for the comments :))
> 
> The Root Complex *includes* an internal bus (pcie.0) but also
> can include some Integrated Devices, its own Configuration Space Registers
> (e.g Root Complex Register Block), ...
> 
> One of the main functions of the Root Complex is to
> generate PCI Express Transactions on behalf of the CPU(s) and
> to "translate" the corresponding PCI Express Transactions into DMA
> accesses.
> 
> I can change it to "PCI Express Root Bus", it will help?

Yes, it will, thank you.

All my other "root complex" mentions below were incorrect, in light of
your clarification, so please consider those accordingly.

> 
>>> +=====================
>>> +Plug only legacy PCI devices as Root Complex Integrated Devices
>>> +even if the PCIe spec does not forbid PCIe devices.
>>
>> I suggest "even though the PCI Express spec does not forbid PCI Express
>> devices as Integrated Devices". (Detail is good!)
>>
> Thanks
> 
>> Also, as Peter suggested, this (but not just this) would be a good place
>> to provide command line fragments.
>>
> 
> I've already added some examples, I'll appreciate if you can have a look
> on v2
> that I will post really soon.
> 
>>> The existing
>>> +hardware uses mostly PCI devices as Integrated Endpoints. In this
>>> +way we may avoid some strange Guest OS-es behaviour.
>>> +Other than that plug only PCIe Root Ports, PCIe Switches (upstream
>>> ports)
>>> +or DMI-PCI bridges to start legacy PCI hierarchies.
>>
>> Hmmmm, I had to re-read this paragraph (while looking at the diagram)
>> five times until I mostly understood it :) What about the following
>> wording:
>>
>> --------
>> Place only the following kinds of devices directly on the Root Complex:
>>
>> (1) For devices with dedicated, specific functionality (network card,
>> graphics card, IDE controller, etc), place only legacy PCI devices on
>> the Root Complex. These will be considered Integrated Endpoints.
>> Although the PCI Express spec does not forbid PCI Express devices as
>> Integrated Endpoints, existing hardware mostly integrates legacy PCI
>> devices with the Root Complex. Guest OSes are suspected to behave
>> strangely when PCI Express devices are integrated with the Root Complex.
>>
>> (2) PCI Express Root Ports, for starting exclusively PCI Express
>> hierarchies.
>>
>> (3) PCI Express Switches (connected with their Upstream Ports to the
>> Root Complex), also for starting exclusively PCI Express hierarchies.
>>
>> (4) For starting legacy PCI hierarchies: DMI-PCI bridges.
>>
> 
> Thanks for the re-wording!
> Actually I had a bug, even the Switches should be connected to Root
> Ports, not directly
> to the PCI Express Root Bus (pcie.0) , I'll delete (3) to make it clear.

Ah, okay. That puts a lot of what I wrote in a different perspective :),
but I think the "as flat as possible" hierarchy should remain a valid
suggestion.

> 
> 
>>> +
>>> +
>>> +   pcie.0 bus
>>
>> "bus" is correct in QEMU lingo, but I'd still call it complex here.
>>
> 
> explained above
> 
>>> +  
>>> --------------------------------------------------------------------------
>>>
>>> +        |                |                    |                   |
>>> +   -----------   ------------------   ------------------ 
>>> ------------------
>>> +   | PCI Dev |   | PCIe Root Port |   |  Upstream Port |  | DMI-PCI
>>> bridge |
>>> +   -----------   ------------------   ------------------ 
>>> ------------------
>>> +
>>
>> Please insert a separate (brief) section here about pxb-pcie devices --
>> just mention that they are documented in a separate spec txt in more
>> detail, and that they create new root complexes in practice.
>>
>> In fact, maybe option (5) would be better for pxb-pcie devices, under
>> section 2.1, than a dedicated section!
>>
> 
> Good idea, I'll add the pxb-pcie device.
> 
>>> +2.2 PCIe only hierarchy
>>> +=======================
>>> +Always use PCIe Root ports to start a PCIe hierarchy. Use PCIe
>>> switches (Upstream
>>> +Ports + several Downstream Ports) if out of PCIe Root Ports slots.
>>> PCIe switches
>>> +can be nested until a depth of 6-7. Plug only PCIe devices into PCIe
>>> Ports.
>>
>> - Please name the maximum number of the root ports that's allowed on the
>> root complex (cmdline example?)
>>
> 
> I'll try:
> The PCI Express Root Bus (pcie.0) is an internal bus that similar to a
> PCI bus
> supports up to 32 Integrated Devices/PCI Express Root Ports.

Thanks, sounds good. Also, apparently, I wasn't wrong about the number 32 :)

>> - Also, this is the first time you mention "slot". While the PCI Express
>> spec allows for root ports / downstream ports not implementing a slot
>> (IIRC), I think we shouldn't muddy the waters here, and restrict the
>> word "slot" to the command line examples only.
>>
> 
> OK
> 
>> - What you say here about switches (upstream ports) matches what I've
>> learned from you thus far :), but it doesn't match bullet (3) in section
>> 2.1. That is, if we suggest to *always* add a Root Port between the Root
>> Complex and the Upstream Port of a switch, then (3) should not be
>> present in section 2.1. (Do we suggest that BTW?)
>>
>> We're not giving a technical description here (the PCI Express spec is
>> good enough for that), we're dictating policy. We shouldn't be shy about
>> minimizing the accepted use cases.
>>
>> Our main guidance here should be the amount of bus numbers used up by
>> the hierarchy. Parts of the document might later apply to
>> qemu-system-aarch64 -M virt, and that machine is severely starved in the
>> bus numbers department (it has MMCONFIG space for 16 buses only!)
>>
>> So how about this:
>>
>> * the basic idea is good I think: always go for root ports, unless the
>> root complex is fully populated
>>
>> * if you run out of root ports, use a switch with downstream ports, but
>> plug the upstream port directly in the root complex (make it an
>> integrated device). This would save us a bus number, and match option
>> (3) in section 2.1, but it doesn't match the diagram below, where a root
>> port is between the root complex and the upstream port. (Of course, if a
>> root port is *required* there, then 2.1 (3) is wrong, and should be
>> removed.)
>>
> 
> A Root Port is required, thanks for spotting the bug.
> 
>> * the "population algorithm" should be laid out in a bit more detail.
>> You mention a possible depth of 6-7, but I think it would be best to
>> keep the hierarchy as flat as possible (let's not waste bus numbers on
>> upstream ports, and time on deep enumeration!). In other words, only
>> plug upstream ports in the root complex (and without intervening root
>> ports, if that's allowed). For example:
>>
>> -  1-32 ports needed: use root ports only
>>
>> - 33-64 ports needed: use 31 root ports, and one switch with 2-32
>> downstream ports
>>
>> - 65-94 ports needed: use 30 root ports, one switch with 32 downstream
>> ports, another switch with 3-32 downstream ports
>>
>> - 95-125 ports needed: use 29 root ports, two switches with 32
>> downstream ports each, and a third switch with 2-32 downstream ports
>>
>> - 126-156 ports needed: use 28 root ports, three switches with 32
>> downstream ports each, and a fourth switch with 2-32 downstream ports
>>
>> - 157-187 ports needed: use 27 root ports, four switches with 32
>> downstream ports each, and a fifth switch with 2-32 downstream ports
>>
>> - 188-218 ports: 26 root ports, 5 fully populated switches, sixth switch
>> with 2-32 downstream ports,
>>
>> - 219-249 ports: 25 root ports, 6 fully pop. switches, seventh switch
>> with 2-32 downstream ports
>>
> 
> I can add it as a "best practice".

That would be highly appreciated, thanks! Of course, with the root ports
being mandatory between the PCI Express Root Bus and the upstream port
of every switch, we hit the bus number limit a bit earlier:

> 
>> (And I think this is where it ends, because the 7 upstream ports total
>> in the switches take up 7 bus numbers, so we'd need 249 + 7 = 256 bus
>> numbers, not counting the root complex, so 249 ports isn't even
>> attainable.)

we'd need 249+7+7.

>>
> 
> Theoretically we can implement multiple PCI domains, each domain can have
> 256 PCI buses, but we don't have that yet. An implementation should
> start with the pxb-pcie using separate PCI domains instead of "stealing"
> bus ranges.
> for the Root Complex. But this is for another thread.

Definitely for another thread :)

> 
>> You might argue that this is way too detailed, but with the "problem
>> space" offering so much freedom (consider libvirt too...), I think it
>> would be helpful. This would also help trim the "explorations" of
>> downstream QE departments :)
>>
> 
> Well, we can accentuate that while nesting is supported, "deep nesting"
> is not recommended and even not strictly necessary.

I agree, thanks.

> 
>>> +
>>> +
>>> +   pcie.0 bus
>>> +   ----------------------------------------------------
>>> +        |                |               |
>>> +   -------------   -------------   -------------
>>> +   | Root Port |   | Root Port |   | Root Port |
>>> +   ------------   --------------   -------------
>>> +         |                               |
>>> +    ------------                 -----------------
>>> +    | PCIe Dev |                 | Upstream Port |
>>> +    ------------                 -----------------
>>> +                                  |            |
>>> +                     -------------------    -------------------
>>> +                     | Downstream Port |    | Downstream Port |
>>> +                     -------------------    -------------------
>>> +                             |
>>> +                         ------------
>>> +                         | PCIe Dev |
>>> +                         ------------
>>> +
>>
>> So the upper right root port should be removed, probably.
>>
> 
> No, my bug in explanation, sorry.
> 
>> Also, I recommend to draw a "container" around the upstream port plus
>> the two downstream ports, and tack a "switch" label to it.
>>
> 
> Really? :) I had an interesting time with these "drawings". I'll try.

Haha, thanks :) If you are an emacs user, it should be easy. (I'm not an
emacs user, but my editor does support macros that makes it okay-ish to
draw some basic ASCII art.)

> 
>>> +2.3 PCI only hierarchy
>>> +======================
>>> +Legacy PCI devices can be plugged into pcie.0 as Integrated Devices or
>>> +into DMI-PCI bridge. PCI-PCI bridges can be plugged into DMI-PCI
>>> bridges
>>> +and can be nested until a depth of 6-7. DMI-BRIDGES should be plugged
>>> +only into pcie.0 bus.
>>> +
>>> +   pcie.0 bus
>>> +   ----------------------------------------------
>>> +        |                            |
>>> +   -----------               ------------------
>>> +   | PCI Dev |               | DMI-PCI BRIDGE |
>>> +   ----------                ------------------
>>> +                               |            |
>>> +                        -----------    ------------------
>>> +                        | PCI Dev |    | PCI-PCI Bridge |
>>> +                        -----------    ------------------
>>> +                                         |           |
>>> +                                  -----------     -----------
>>> +                                  | PCI Dev |     | PCI Dev |
>>> +                                  -----------     -----------
>>
>> Works for me, but I would again elaborate a little bit on keeping the
>> hierarchy flat.
>>
>> First, in order to preserve compatibility with libvirt's current
>> behavior, let's not plug a PCI device directly in to the DMI-PCI bridge,
>> even if that's possible otherwise. Let's just say
>>
>> - there should be at most one DMI-PCI bridge (if a legacy PCI hierarchy
>> is required),
>>
>> - only PCI-PCI bridges should be plugged into the DMI-PCI bridge,
>>
>> - let's recommend that each PCI-PCI bridge be populated until it becomes
>> full, at which point another PCI-PCI bridge should be plugged into the
>> same one DMI-PCI bridge. Theoretically, with 32 legacy PCI devices per
>> PCI-PCI bridge, and 32 PCI-PCI bridges stuffed into the one DMI-PCI
>> bridge, we could have ~1024 legacy PCI devices (not counting the
>> integrated ones on the root complex(es)). There's also multi-function,
>> so I can't see anyone needing more than this.
>>
> 
> I can "live" with that. Even if it contradicts a little you flattening
> argument if you need more room for PCI devices but you don't need hotplug.
> In this case adding PCI devices to the DMI-PCI Bridge should be enough.
> But I agree we should keep it as simple as possible and your idea makes
> sense, thanks.
> 
> 
>> For practical reasons though (see later), we should state here that we
>> recommend no more than 9 (nine) PCI-PCI bridges in total, all located
>> directly under the 1 (one) DMI-PCI bridge that is integrated into the
>> pcie.0 root complex. Nine PCI-PCI bridges should allow for 288 legacy
>> PCI devices. (And then there's multifunction.)
>>
> 
> OK... BTW the ~9 bridges limitation is the same for non PCI Express
> machines
> e.g. i440FX machine.
> 
>>> +
>>> +
>>> +
>>> +3. IO space issues
>>> +===================
>>> +PCIe Ports are seen by Firmware/Guest OS as PCI bridges and
>>
>> (please spell out downstream + root port)
>>
> 
> OK
> 
>>> +as required by PCI spec will reserve a 4K IO range for each.
>>> +The firmware used by QEMU (SeaBIOS/OVMF) will further optimize
>>> +it by allocation the IO space only if there is at least a device
>>> +with IO BARs plugged into the bridge.
>>
>> This used to be true, but is no longer true, for OVMF. And I think it's
>> actually correct: we *should* keep the 4K IO reservation per PCI-PCI
>> bridge.
>>
> 
> I'll change to "should".
> 
>> (But, certainly no IO reservation for PCI Express root port, upstream
>> port, or downstream port! And i'll need your help for telling these
>> apart in OVMF.)
>>
> 
> Just let me know how can I help.

Well, in the EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
implementation, I'll have to look at the PCI config space of the
"bridge-like" PCI device that the generic PCI Bus driver of edk2 passes
back to me, asking me about resource reservation.

Based on the config space, I should be able to tell apart "PCI-PCI
bridge" from "PCI Express downstream or root port". So what I'd need
here is a semi-formal natural language description of these conditions.

Hmm, actually I think I've already written code, for another patch, that
identifies the latter category. So everything where that check doesn't
fire can be deemed "PCI-PCI bridge". (This hook gets called only for
bridges.)

Yet another alternative: if we go for the special PCI capability, for
exposing reservation sizes from QEMU to the firmware, then I can simply
search the capability list for just that capability. I think that could
be the easiest for me.

> 
>> Let me elaborate more under section "4. Hot Plug". For now let me just
>> say that I'd like this language about optimization to be dropped.
>>
>>> +Behind a PCIe PORT only one device may be plugged, resulting in
>>
>> (again, please spell out downstream and root port)
>>
> 
> OK
> 
>>> +the allocation of a whole 4K range for each device.
>>> +The IO space is limited resulting in ~10 PCIe ports per system
>>
>> (limited to 65536 byte-wide IO ports, but it's fragmented, so we have
>> about 10 * 4K free)
>>
>>> +if devices with IO BARs are plugged into IO ports.
>>
>> not "into IO ports" but "into PCI Express downstream and root ports".
>>
> 
> oops, thanks
> 
>>> +
>>> +Using the proposed device placing strategy solves this issue
>>
>>> +by using only PCIe devices with PCIe PORTS. The PCIe spec requires
>>
>> (please spell out root / downstream etc)
>>
> 
> OK
> 
>>> +PCIe devices to work without IO BARs.
>>> +The PCI hierarchy has no such limitations.
>>
>> I'm sorry to have fragmented this section with so many comments, but the
>> idea is actually splendid, in my opinion!
>>
> 
> Thanks!
> 
>>
>> ... Okay, still speaking resources, could you insert a brief section
>> here about bus numbers? Under "3. IO space issues", you already explain
>> how "practically everything" qualifies as a PCI bridge. We should
>> mention that all those things, such as:
>> - root complex pcie.0,
>> - root complex added by pxb-pcie,
>> - root ports,
>> - upstream ports, downstream ports,
>> - bridges, etc
>>
>> take up bus numbers, and we have 256 bus numbers in total.
>>
> 
> I'll add a section for bus numbers, sure.
> 
>> In the next section you state that PCI hotplug (ACPI based) and PCI
>> Express hotplug (native) can work side by side, which is correct, and
>> the IO space competition is eliminated by the scheme proposed in section
>> 3 -- and the MMIO space competition is "obvious" --, but the bus number
>> starvation is *very much* non-obvious. It should be spelled out. I think
>> it deserves a separate section. (Again, with an eye toward
>> qemu-system-aarch64 -M virt -- we've seen PCI Express failures there,
>> and they were due to bus number starvation. It wasn't fun to debug.
>> (Well, it was, but don't tell anyone :)))
>>
> 
> Got it, I'll try to make PCI Bus numbering a limitation as important as IO.
> And we need to start looking at ways to solve this:
>   1. pxb-pcie starting different PCI domains - pxb-pcie became another
> Root Complex
>   2. switches can theoretically start PCI domains - emulate a Switch
> doing this.
> Long term plans, of course.

Right, let's not rush that; first I'd like to reach a status where PCI
and PCI Express hotplug "just works" with OVMF... And if it fails, I
should be able to point at the user's config, and this document, and say
"wrong configuration". That's the goal. :)

> 
>>> +
>>> +
>>> +4. Hot Plug
>>> +============
>>> +The root bus pcie.0 does not support hot-plug, so Integrated Devices,
>>
>> s/root bus/root complex/? Also, any root complexes added with pxb-pcie
>> don't support hotplug.
>>
> 
> Actually pxb-pcie should support PCI Express Native Hotplug.

Huh, interesting.

> If they don't is a bug and I'll take care of it.

Hmm, a bit lower down you mention that PCI Express native hot plug is
based on SHPCs. So, when you say that pxb-pcie should support PCI
Express Native Hotplug, you mean that it should occur through SHPC, right?

However, for pxb-pci*, we had to disable SHPC: see QEMU commit
d10dda2d60c8 ("hw/pci-bridge: disable SHPC in PXB"), in June 2015.

For background, the series around it was
<https://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05136.html>
-- I think v7 was the last version.

... Actually, now I wonder if d10dda2d60c8 should be possible to revert
at this point! Namely, in OVMF I may have unwittingly fixed this issue
-- obviously much later than the QEMU commit: in March 2016. See

https://github.com/tianocore/edk2/commit/8f35eb92c419

If you look at the commit message of the QEMU patch, it says

    [...]

    Unfortunately, when this happens, the PCI_COMMAND_MEMORY bit is
    clear in the root bus's command register [...]

which I think should no longer be true, thanks to edk2 commit 8f35eb92c419.

So maybe we should re-evaluate QEMU commit d10dda2d60c8. If pxb-pci and
pxb-pcie work with current OVMF, due to edk2 commit 8f35eb92c419, then
maybe we should revert QEMU commit d10dda2d60c8.

Not urgent for me :), obviously, I'm just explaining so you can make a
note for later, if you wish to (if hot-plugging directly into pxb-pcie
should be necessary -- I think it's very low priority).

> For pxb-pci (the PCI counter-part) is another story, it needs ACPI code
> to be emitted and the feature is not yet implemented.
> 
> 
>>> +DMI-PCI bridges and Root Ports can't be hot-plugged/hot-unplugged.
>>
>> I would say: ... so anything that plugs *only* into a root complex,
>> cannot be hotplugged. Then the list is what you mention here (also
>> referring back to options (1), (2) and (4) in section 2.1), *plus* I
>> would also add option (5): pxb-pcie can also not be hotplugged.
>>
> 
> because is actually an Integrated Device.
> 
>>> +
>>> +PCI devices can be hot-plugged into PCI-PCI bridges. (There is a bug
>>> +in QEMU preventing it to work, but it would be solved soon).
>>
>> What bug?
>>
> 
> As stated above, PCI hotplug is based on emitting ACPI code for
> recognizing the right slot (see "bsel" ACPI variables). Basically each
> PCI-Bridge slot has a different "bsel" value used during
> hotplug mechanism to identify the slot where the device is
> hot-plugged/hot-unplugged.
> 
> For PC machine the ACPI is generated while for Q35 is not.
> (I think I've already sent an RFC some time ago for that)
> 
>> Anyway, I'm unsure we should add this remark here -- it's a guide, not a
>> status report. I'm worried that whenever we fix that bug, we forget to
>> remove this remark.
>>
> 
> will remove it
> 
>>> +The PCI hotplug is ACPI based and can work side by side with the PCIe
>>> +native hotplug.
>>> +
>>> +PCIe devices can be natively hot-plugged/hot-unplugged into/from
>>> +PCIe Ports (Root Ports/Downstream Ports). Switches are hot-pluggable.
>>
>> I would mention the order (upstream port, downstream port), also add
>> some command lines maybe.
>>
> 
> I'll add some hmp example. Should I try it before :) ?

Seeing it function as expected wouldn't hurt! :)

> 
>>> +Keep in mind you always need to have at least one PCIe Port available
>>> +for hotplug, the PCIe Ports themselves are not hot-pluggable.
>>
>> Well, the downstream ports of a switch that is being added *are*, aren't
>> they?
> 
> Nope, you cannot hotplug a PCI Express Root Port or a PCI Express
> Downstream Port.
> The reason: The PCI Express Native Hotplug is based on SHPCs (Standard
> HotPlug Controllers)
> which are integrated only in the mentioned ports and not in Upstream
> Ports or the Root Complex.
> The "other" reason: When you buy a switch/server it has a number of
> ports and that's it.
> You cannot add "later".

Makes sense, thank you. I think if you add the HMP example, it will make
it clear. I only assumed that you needed several monitor commands for
hotplugging a single switch (i.e., one command per one port) because on
the QEMU command line you do need a separate -device option for the
upstream port, and every single downstream port, of the same switch.

If, using the monitor, it's just one device_add for the upstream port,
and the downstream ports are added automatically, then I guess it'll be
easy to understand.

> 
>>
>> But, this question is actually irrelevant IMO, because here I would add
>> another subsection about *planning* for hot-plug. (I think that's pretty
>> important.) And those plans should make the hotplugging of switches
>> unnecessary!
>>
> 
> I'll add a subsection for it. But when you are out of options you *can*
> hotplug a switch if your sysadmin skills are limited...

You probably can, but then we'll run into the resource allocation
problem again:

(1) The user will hotplug a switch (= S1) under a root port with, say,
two downstream ports (= S1-DP1, S1-DP2).

(2) They'll then plug a PCI Express device into one of those downstream
ports (S1-DP1-Dev1).

(3) Then they'll want to hot-plug *another* switch into the *other*
downstream port (S1-DP2-S2).


                         DP1 -- Dev1 (2)
                        /
     root port -- S1 (1)
                        \
                         DP2 -- S2 (3)

However, concerning the resource needs of S2 (and especially the devices
hot-plugged under S2!), S1 won't have enough left over, because Dev1
(under DP1) will have eaten into them, and Dev1's BARs will have been
programmed!

We could never credibly explain our way out of this situation in a bug
report. For that reason, I think we should discourage hotplug ideas that
would change the topology, and require recursive resource allocation at
higher levels and/or parallel branches of the topology.

I know Linux can do that, and it even succeeds if there is enough room,
but from the messages seen in the guest dmesg when it fails, how do you
explain to the user that they should have plugged in S2 first, and Dev1
second?

So, we should recommend *not* to hotplug switches or PCI-PCI bridges.
Instead,
- keep a very flat hierarchy from the start;
- for PCI Express, add as many root ports and downstream ports as you
deem enough for future hotplug needs (keeping the flat formula I described);
- for legacy PCI, add as many sibling PCI-PCI bridges directly under the
one DMI-PCI bridge as you deem sufficient for future hotplug needs.

In short, don't change the hierarchy at runtime by hotplugging internal
nodes; hotplug *leaf nodes* only.

> 
>> * For the PCI Express hierarchy, I recommended a flat structure above.
>> The 256 bus numbers can easily be exhausted / covered by 25 root ports
>> plus 7 switches (each switch being fully populated with downstream
>> ports). This should allow all sysadmins to estimate their expected
>> numbers of hotplug PCI Express devices, in advance, and create enough
>> root ports / downstream ports. (Sysadmins are already used to planning
>> for hotplug, see VCPUs, memory (DIMM), memory (balloon).)
>>
> 
> That's another good idea, I'll add it to the doc, thanks!
> 
>> * For the PCI hierarchy, it should be even simpler, but worth a mention
>> -- start with enough PCI-PCI bridges under the one DMI-PCI bridge.
>>
> 
> OK
> 
>> * Finally, this is the spot where we should design and explain our
>> resource reservation for hotplug:
>>
>>   - For PCI Express hotplug, please explain that only such PCI Express
>>     devices can be hotplugged that require no IO space -- in section
>>     "3. IO space issues" you mention that this is a valid restriction.
>>     Furthermore, please state the MMIO32 and/or MMIO64 sizes that the
>>     firmware needs to reserve for root ports / downstream ports, and
>>     also explain that these sizes will act as maximum size and
>>     alignment limits for *individual* hotplug devices.
>>
> 
> OK
> 
>>     We can invent fw_cfg switches for this, or maybe even a special PCI
>>     Express capability (to be placed in the config space of root and
>>     downstream ports).
>>
> 
> Gerd explicitly asked for the second idea (vendor specific capability)

Nice, thank you for confirming it; let's do this then. It will also
simplify my work in the
EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() function: it should
suffice to scan the config space of the bridge, regardless of the
"PCI-PCI bridge / PCI Express root or downstream port" distinction.

> 
>>   - For legacy PCI hotplug, this is where my evil plan of "no more than
>>     9 (nine) PCI-PCI bridges under the 1 (one) DMI-PCI bridge" unfolds!
>>
> 
> The same as for PC, should this doc deal with this if is a common issue?
> Maybe a simple comment is enough.
> 
>>     We've stated above (in Section 3) that we have about 10*4KB IO port
>>     space. One of those 4K chunks will go (collectively) to the
>>     Integrated PCI devices that sit on pcie.0. (If there are other root
>>     complexes from pxb-pcie devices, then those will get one chunk each
>>     too.) The rest -- hence assume 9 or fewer chunks -- will be consumed
>>     by the 9 (or respectively fewer) PCI-PCI bridges, for hotplug
>>     reservation. The upshot is that as long as a sysadmin sticks with
>>     our flat, "9 PCI-PCI bridges total" recommendation for the legacy
>>
>>     PCI hierarchy, the IO reservation will be covered *immediately*.
>>     Simply put: don't create more PCI-PCI bridges than you have IO
>>     space for -- and that should leave you with about 9 "sibling"
>>     bridges, which are plenty enough for a huge number of legacy PCI
>>     devices!
>>
> 
> I'll use that, thanks!
> 
>>     Furthermore, please state the MMIO32 / MMIO64 amount to reserve
>>     *per PCI-PCI bridge*. The firmware programmers need to know this,
>>     and people planning for legacy PCI hotplug should be informed that
>>     those limits are for all devices *together* on the same PCI-PCI
>>     bridge.
>>
> 
> Yes... we'll ask here for the minimum 8MB MMIO because of the virtio 1.0
> behavior and use that when "pushing" the patches for OVMF/SeaBIOS.
> Its fun, first we "make" the rules, then we say "Hey, is written in QEMU
> docs".

Absolutely. This is how it should work! ;)

The best part of Gerd's suggestion, as far as the firmwares are
concerned, is that we won't have to hard-code any constants in the
firmware. We'll just have to parse the PCI config spaces of the bridges,
for the vendor specific capability, and use the numbers from the
capability for resource reservation. The OVMF and SeaBIOS patches won't
have any constants in them. :)

Should we change our minds in QEMU later, no firmware patches should be
necessary.

> 
>>     Again, we could expose this via fw_cfg, or in a special capability
>>     (as suggested by Gerd IIRC) in the PCI config space of the PCI-PCI
>>     bridge.
>>
> 
> Agreed
> 
>>> +
>>> +
>>> +5. Device assignment
>>> +====================
>>> +Host devices are mostly PCIe and should be plugged only into PCIe
>>> ports.
>>> +PCI-PCI bridge slots can be used for legacy PCI host devices.
>>
>> Please provide a command line (lspci) so that users can easily determine
>> if the device they wish to assign is legacy PCI or PCI Express.
>>
> 
> OK, something like:
> 
> lspci -s 03:00.0 -v (as root)
> 03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
>     Subsystem: Intel Corporation Dual Band Wireless-AC 7260
>     Flags: bus master, fast devsel, latency 0, IRQ 50
>     Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
>     Capabilities: [c8] Power Management version 3
>     Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>     Capabilities: [40] Express Endpoint, MSI 00
> 
>                             ^^^^^^^^^^^^^^^
> 
>     Capabilities: [100] Advanced Error Reporting
>     Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
>     Capabilities: [14c] Latency Tolerance Reporting
>     Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1
> Len=014 <?>

Yep, looks great.

> 
> 
> 
>>> +
>>> +
>>> +6. Virtio devices
>>> +=================
>>> +Virtio devices plugged into the PCI hierarchy or as an Integrated
>>> Devices
>>
>> (drop "an")
>>
> 
> OK
> 
>>> +will remain PCI and have transitional behaviour as default.
>>
>> (Please add one sentence about what "transitional" means in this context
>> -- they'll have both IO and MMIO BARs.)
>>
> 
> OK
> 
>>> +Virtio devices plugged into PCIe ports are Express devices and have
>>> +"1.0" behavior by default without IO support.
>>> +In both case disable-* properties can be used to override the
>>> behaviour.
>>
>> Please emphasize that setting disable-legacy=off (that is, enabling
>> legacy behavior) for PCI Express virtio devices will cause them to
>> require IO space, which, given our PCI Express hierarchy, may quickly
>> lead to resource exhaustion, and is therefore strongly discouraged.
>>
> 
> Sure
> 
>>> +
>>> +
>>> +7. Conclusion
>>> +==============
>>> +The proposal offers a usage model that is easy to understand and follow
>>> +and in the same time overcomes some PCIe limitations.
>>
>> I agree!
>>
>> Thanks!
>> Laszlo
>>
> 
> 
> Thanks for the detailed review! I was planning on sending V2 today,
> but to follow your comments I will need another day.
> Don't get me wrong, it totally worth it :)

Thank you. I'm very happy about this document coming along. And, I too
will likely need a separate day to review your v2. :)

Cheers!
Laszlo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]