qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] hw/pci: do not update the PCI mappings while De


From: Laszlo Ersek
Subject: Re: [Qemu-devel] [PATCH] hw/pci: do not update the PCI mappings while Decode (I/O or memory) bit is not set in the Command register
Date: Thu, 14 Jan 2016 15:30:41 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1

On 01/14/16 13:24, Marcel Apfelbaum wrote:
> On 01/11/2016 08:57 PM, Marcel Apfelbaum wrote:
>> On 01/11/2016 08:44 PM, Laszlo Ersek wrote:
>>> On 01/11/16 19:01, Marcel Apfelbaum wrote:
>>>> On 01/11/2016 07:15 PM, Laszlo Ersek wrote:
>>>>> On 01/11/16 17:34, Marcel Apfelbaum wrote:
>>>>>> On 01/11/2016 06:11 PM, Laszlo Ersek wrote:
>>>>>>> On 01/11/16 13:24, Marcel Apfelbaum wrote:
>>>>>>>> Two reasons:
>>>>>>>>     - PCI Spec indicates that while the bit is not set
>>>>>>>>       the memory sizing is not finished.
>>>>>>>>     - pci_bar_address will return PCI_BAR_UNMAPPED
>>>>>>>>       and a previous value can be accidentally overridden
>>>>>>>>       if the command register is modified (and not the BAR).
>>>>>>>>
>>>>>>>> Signed-off-by: Marcel Apfelbaum <address@hidden>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I found this when trying to use multiple root complexes with OVMF.
>>>>>>>>
>>>>>>>> When trying to attach a device to the pxb-pcie device as Integrated
>>>>>>>> Device it did not receive the IO/MEM resources.
>>>>>>>>
>>>>>>>> The reason is that OVMF is working like that:
>>>>>>>>     1. It disables the Decode (I/O or memory) bit in the Command
>>>>>>>> register
>>>>>>>>     2. It configures the device BARS
>>>>>>>>     3. Makes some tests on the Command register
>>>>>>>>     4. ...
>>>>>>>>     5. Enables the Decode (I/O or memory) at some point.
>>>>>>>>
>>>>>>>> On step 3 all the BARS are overridden to 0xffffffff by QEMU.
>>>>>>>>
>>>>>>>> Since QEMU uses the device BARs to compute the new host bridge
>>>>>>>> resources
>>>>>>>> it now gets garbage.
>>>>>>>>
>>>>>>>> Laszlo, this also solves the SHPC problem for the pci-2-pci bridge
>>>>>>>> inside the pxb.
>>>>>>>> Now we can enable the SHPC for it too.
>>>>>>>
>>>>>>> I encountered the exact same problem months ago. I posted patches
>>>>>>> for
>>>>>>> it; you were CC'd. :)
>>>>>>>
>>>>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/342206/focus=342209
>>>>>>>
>>>>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/342206/focus=342210
>>>>>>>
>>>>>>>
>>>>>>> As you can see under the second link above, I made the same
>>>>>>> analysis &
>>>>>>> observations as you do now. (It took me quite long to track down the
>>>>>>> "inexplicable" behavior of edk2's generic PCI bus driver /
>>>>>>> enumerator
>>>>>>> that is built into OVMF.)
>>>>>>
>>>>>> Wow, I just re-worked this issue again from 0! I wish I have
>>>>>> remembered
>>>>>> those threads :(
>>>>>> This was another symptom of the exact problem! And I remembered
>>>>>> something about
>>>>>> SHPC, I should have looked at those mail threads again...
>>>>>>
>>>>>>>
>>>>>>> I proposed to change pci_bar_address() so that it could return, to
>>>>>>> distinguished callers, the BAR values "under programming", even
>>>>>>> if the
>>>>>>> command bits were clear. Then the ACPI generator would utilize this
>>>>>>> special exception.
>>>>>>>
>>>>>>> Michael disagreed; in
>>>>>>>
>>>>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/342206/focus=342242
>>>>>>>
>>>>>>>
>>>>>>> he wrote "[t]his is problematic - disabled BAR values have no
>>>>>>> meaning
>>>>>>> according to the PCI spec".
>>>>>>>
>>>>>>
>>>>>> Yes... because it looked like a hook for our case only,
>>>>>> the good news is that this patch is based exactly on the fact that
>>>>>> the BARs have no meaning if the bit is not set.
>>>>>>
>>>>>>> The current  solution to the problem (= we disable the SHPC) was
>>>>>>> recommended by Michael in that message: "It might be best to add a
>>>>>>> property to just disable shpc in the bridge so no devices reside
>>>>>>> directly behind the pxb?"
>>>>>>>
>>>>>>
>>>>>> I confess I don't exactly understand what the SHPC of the pci-2-pci
>>>>>> bridge
>>>>>> has to do with sibling devices on the pxb's root bus (SHPC is the
>>>>>> hot-plug controller
>>>>>> for the devices behind the pci-2-pci bridge).
>>>>>>
>>>>>> The second part I do understand, the pxb design was to not have
>>>>>> devices
>>>>>> directly behind
>>>>>> the pxb, so maybe he meant that SHPC is the part of the pci-bridge
>>>>>> that
>>>>>> behaves like
>>>>>> a device in the sense it requires IO/MEM resources.
>>>>>>
>>>>>> Bottom line, your solution for the PXB was just fine :)
>>>>>>
>>>>>>
>>>>>>> In comparison, your patch doesn't change pci_bar_address().
>>>>>>> Instead, it
>>>>>>> modifies pci_update_mappings() *not to call* pci_bar_address(),
>>>>>>> if the
>>>>>>> respective command bits are clear.
>>>>>>>
>>>>>>> I guess that could have about the same effect.
>>>>>>>
>>>>>>> If, unlike my patch, yours actually improves QEMU's compliance
>>>>>>> with the
>>>>>>> PCI specs, then it's likely a good patch. (And apparently more
>>>>>>> general
>>>>>>> than the SHPC-specific solution we have now.)
>>>>>>
>>>>>>
>>>>>> Exactly! Why should a pci write to the command  register *delete*
>>>>>> previously set resources? I am looking at it as a bug.
>>>>>>
>>>>>> And also updating the mappings while the Decoding bit is not enables
>>>>>> is at least not necessary.
>>>>>>
>>>>>>>
>>>>>>> I just don't know if it's a good idea to leave any old mappings
>>>>>>> active
>>>>>>> while the BARs are being reprogrammed (with the command bits clear).
>>>>>>>
>>>>>>
>>>>>> First, because the OS can't use the IO/MEM anyway, secondly the guest
>>>>>> OS/firmware
>>>>>> is the one that disabled the bit... (in order to program resources)
>>>>>
>>>>> I have something like the following in mind. Do you think it is a
>>>>> valid
>>>>> (although contrived) use case?
>>>>>
>>>>> - guest programs some BAR and uses it (as designed / intended)
>>>>> - guest disables command bit, modifies BAR location
>>>>> - guest accesses *old* BAR location
>>>>>
>>>>> What should a guest *expect* in such a case? Is this invalid guest
>>>>> behavior?
>>>>
>>>> Yes, this is indeed invalid behavior, from the device point of view
>>>> it is disabled. Best case scenario - the guest will see 0xffffffff,
>>>> worst case - garbage.
>>>>
>>>>>
>>>>> If it is not invalid, then will QEMU comply with the guest's
>>>>> expectations if your patch is applied? Pre-patch, the guest would
>>>>> likely
>>>>> access a "hole" in the host bridge MMIO aperture, whereas with your
>>>>> patch (I guess?) it still might access the device through the old
>>>>> (still
>>>>> active) BAR?
>>>>>
>>>>
>>>> Since the IO is disabled, pci_bar_address will return PCI_BAR_UNMAPPED
>>>> and *no updates will be made* pre or post this patch. It will behave
>>>> the
>>>> same
>>>> from the guest point of view. It will still access the memory region
>>>> of the device.
>>>>
>>>>
>>>>> Or would QEMU prevent that just by virtue of the command bit being
>>>>> clear?
>>>>
>>>> Again, this patch only changes the behavior in a specific case:
>>>> when the device is disabled and the guest writes to the command
>>>> register
>>>> without
>>>> enabling IO/MEM.
>>>
>>> Ah, right! That's exactly what the edk2 PCI bus driver / enumerator
>>> does. Massages the command register without touching those two bits,
>>> and...
>>>
>>>>
>>>> Pre-patch -> the old BAR addresses are overridden with 0xffffffff
>>>>               (I really think this is a bug, nobody asked for this!!)
>>>
>>> the previously programmed BAR address gets lost. It's great that you
>>> have the PCI knowledge to state that this is actually a bug! It had
>>> looked fishy to me as well, but I couldn't make the same argument.
>>>
>>>> Post-Patch -> the old BAR values are returned to the prev state ->
>>>> correct behavior IMHO.
>>>
> 
> Hi Laszlo,
> 
> After an off-line talk with Michael it seems that this solution is also
> not valid.
> 
> When we build the ACPI table we don't take the IO/mem ranges directly
> from the Configuration Space,
> but using the proper wrappers to ensure the values are valid.
> 
> When the IO decoding is disabled the wrappers do no modify the actual
> Configuration Space,
> but only the mappings to show that the values are not valid.
> 
> It doesn't matter what are we going to do here, the real problem lies in
> OVMF not *enabling*
> IO/Mem decoding before requesting the ACPI tables. (If the OVMF would
> not do the "massage" we will
> get the right values be chance, we should not look to the values when IO
> is disabled)

When discussing this, please address the code that is specific to OVMF
*separately* from code that is generic in edk2. OVMF (the binary) comes
together from a large number of modules. The code that is under my
co-maintainership, and is specific to QEMU, lives under OvmfPkg/ in the
edk2 tree. Most other modules in the tree, especially under
MdeModulePkg/, are generic, and run unmodified on physical hardware.

Why is this important? Because it is easy (or easier, anyway) to adapt
things in OvmfPkg/ to QEMU. Modifying code in MdeModulePkg/ so that it
works better with QEMU is *very* hard, especially if the changes are
intrusive. The PCI bus driver (that is responsible for the enumeration
etc) is generic, and lives in "MdeModulePkg/Bus/Pci/PciBusDxe/". If
you'd like to change that, that will require extremely strong arguments
for the maintainer to accept.

MdeModulePkg/Bus/Pci/PciBusDxe communicates with the platform-specific
PCI host bridge / root bridge driver. OVMF does have a specific version
of that driver, in "OvmfPkg/PciHostBridgeDxe".

Dependent on whether you find an issue in the (higher level, generic)
"MdeModulePkg/Bus/Pci/PciBusDxe", versus the lower level, specific
"OvmfPkg/PciHostBridgeDxe", the effort required to fix the problem may
differ hugely.

The massaging of the command register occurs in
"MdeModulePkg/Bus/Pci/PciBusDxe". If we say that the command register
massaging is invalid, then the first question that the maintainer will
likely ask is, "why does it work on physical hardware then".

Unless we can argue that the massaging works on physical hardware *in
spite of* the PCI specification, just as a lucky coincidence (that is,
we can prove that the PciBusDxe code violates the PCI spec), our claim
might well turn out a dead end.

(Unless the change needed is really non-intrusive, or can be controlled
with build-time flags.)

> 
> Bottom line, we have 2 options:
> 1. Modify the OVMF to enable IO Decoding *before* loading the ACPI tables.

OVMF downloads the ACPI tables over fw_cfg strictly after PCI
enumeration and resource allocation complete. This is the timeline:

(1) The "OvmfPkg/PciHostBridgeDxe" driver is dispatched by the DXE core
at some point. This is a DXE_DRIVER, which means that it produces it
protocols (i.e., it makes structures with function pointers in them
available to the rest of the firmware) right in its entry point
function. You can see the GUIDs of these protocols listed in
"PciHostBridgeDxe.inf", under [Protocols], with the comment ## PRODUCES.
Two protocols are relevant here:

- gEfiPciHostBridgeResourceAllocationProtocolGuid
- gEfiPciRootBridgeIoProtocolGuid

This driver exposes the root bridges (root buses) that are available in
the system -- one EFI_PCI_ROOT_BRIDGE_IO_PROTOCOL structure for each
root bus --, and a "singleton"
EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL structure that allows
the higher level PCI bus driver to allocate resources.

(2) The "MdeModulePkg/Bus/Pci/PciBusDxe" driver is launched by the DXE
core at some point. If you look at "PciBusDxe.inf", you can see in the
[Protocols] section that the driver consumes the following protocols
(among other things):

- gEfiPciHostBridgeResourceAllocationProtocolGuid
- gEfiPciRootBridgeIoProtocolGuid

That is, the two produced in (1).

Now, *unlike* the driver in (1), this is not a DXE_DRIVER module, but a
UEFI_DRIVER module. Meaning, it conforms to the UEFI driver model: it
doesn't do practically anything in its entry point function; instead it
expects callback via the driver binding protocol. It won't bind root
bridge devices until it is explicitly asked to, whereas the root bridge
/ host bridge driver in (1) produces the root bridge protocol instances
right off the bat.

You can also witness this in the [Protocols] usage comments. For a
DXE_DRIVER, you have PRODUCES and CONSUMES -- these are "driver level"
annotations. For a UEFI_DRIVER, in connection with the driver binding
protocol, you will see TO_START (= the protocol this driver is capable
of binding) and BY_START (= the protocol the driver produces on top of
the handles it binds).

For the PCI bus driver, both protocols listed above are marked TO_START.
We also have "gEfiPciIoProtocolGuid", marked BY_START.

This means that the PCI bus driver will bind, *when asked*, the root
bridge protocol instances (that were automatically produced in (1)). It
enumerates the devices on those root bridges (and recursively any other
non-root bridges), allocate resources for them (with the other protocol
listed in (1)), and produce, *when asked*, EFI_PCI_IO_PROTOCOL
instances, one for each PCI device found.

(

In turn, the various PCI device drivers (video driver, virtio-block
driver, etc) consume EFI_PCI_IO_PROTOCOL, and produce other UEFI
abstractions on top (Graphics Output Protocol, Block IO protocol, etc).

)

Finally, you can also find this under [Protocols], for PciBusDxe:

  gEfiPciEnumerationCompleteProtocolGuid          ## PRODUCES

Note that this is *not* a BY_START remark, but a PRODUCES remark. The
PCI bus driver will install a firmware-wide empty protocol structure,
with the above protocol GUID, after the enumeration / resource
allocation is complete. This allows other drivers to *wait* for that
enumeration to complete.

(3) OVMF's Acpi Platform Driver, in
OvmfPkg/AcpiPlatformDxe/AcpiPlatformDxe.inf, is dispatched at some point
by the DXE core. This is a DXE_DRIVER, meaning it does stuff immediately
in its entry point.

What it does is the following. It checks for the presence of
gEfiPciEnumerationCompleteProtocolGuid in the protocol database. If that
protocol is available immediately, then the driver knows it has been
dispatched *after* PCI enumeration completed, so it goes with its
business right ahead (see below). If the protocol is not available yet,
then it registers a *protocol notify* callback for the same, and exits
the entry point function. (It stays resident of course.)

When the PCI bus driver finishes enumeration sometimes later, and
installs said protocol, then the ACPI platform driver receives the
callback, and performs it business finally.

And that business is to download the ACPI payload from QEMU over fw_cfg,
"execute" the linker-loader script, and install the ACPI tables, calling
EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable().

(4) The final piece of the puzzle is the BDS driver (Boot Device
Selection driver). This is also a DXE_DRIVER. The DXE core dispatches it
as the last step of the DXE phase.

This driver is a generic driver (in
IntelFrameworkModulePkg/Universal/BdsDxe), but OVMF customizes it with a
plug-in library instance, which lives in
"OvmfPkg/Library/PlatformBdsLib". The important function in the library
is PlatformBdsPolicyBehavior() -- the generic BDS driver calls it.

In this function, we have:

  VisitAllInstancesOfProtocol (&gEfiPciRootBridgeIoProtocolGuid,
    ConnectRootBridge, NULL);

This looks simple, but it does a whole lot:

(a) It locates all the root bridge IO protocol instances that (1)
    automatically produced, and calls the gBS->ConnectController()
    service on each. This is what *asks* the PCI bus driver to bind each
    root bridge.

(b) In turn the PCI bus driver does the enumeration and resource
    allocation, communicating with OvmfPkg/PciHostBridgeDxe in the
    process. It procudes a bunch of PciIo protocol instances for the
    devices it finds. Finally it installs the "PCI enumeration
    complete" protocol discussed in (2).

    At this point the PCI configuration *should* be final, as far as
    QEMU is concerned.

(c) The ACPI platform driver described in (3) comes to life, and
    downloads the payload via fw_cfg. The payload is regenerated by
    QEMU on the spot, based on the current PCI configuration.

(5) *later*, but still in OvmfPkg/Library/PlatformBdsLib, we call
BdsLibConnectAll(). This is a utility library function (not specific to
OvmfPkg) that binds *all* possible UEFI drivers to *all* possible device
handles.

This is when the specific PCI device drivers I mentioned above (video,
virtio, etc) bind the PciIO protocol instances -- produced in (b) -- and
produce the higher level abstractions (Graphics Output Protocol, Block
IO protocol), on top of wich even higher level protocols are produced by
other drivers, recursively. (Disk IO on Block IO, partitions on Disk IO,
filesystems on partitions, for example.)

The point is, the IO and MMIO enable bits, in each individual PCI
device's command register, are set *only* in this stage, by the
individual UEFI drivers that recognize, and drive, that specific kind of
PCI device (most frequently based on vendor and device ID).

For example, a QXL video card sees this occur when the
QemuVideoControllerDriverStart() function, in
"OvmfPkg/QemuVideoDxe/Driver.c", binds the QXL card's PciIo protocol
instance -- search it for "EFI_PCI_DEVICE_ENABLE". (The definition of
that macro is in "MdePkg/Include/Protocol/PciIo.h".)


Bottom line, the ACPI payload is downloaded after the enumeration /
resource allocation completes, but *not* after each single PCI device is
*enabled* via its own command register.

If the ACPI payload actually depends on *that*, that's trouble. For
example, I could interrupt the boot process, and go to the UEFI shell
first, before OVMF boots "grub" (or Windows). Then I can manually
*unbind* some drivers from some devices.

For example, I can manually unbind QemuVideoDxe from the QXL card. Being
the well behaved UEFI_DRIVER that it is, QemuVideoDxe will restore the
original PCI attributes for this device -- meaning, it will clear the
various decode bits in the command register. If I happen to have a
serial console for the virtual machine as well, then I can easily
continue using the VM over that. I can return to the UEFI setup utility
from the UEFI shell, and allow the boot process to continue -- without a
configured Graphics Output Protocol for the QXL card!

I can do this with any (well written) UEFI_DRIVER module. Therefore, the
decode bits may change for any given PCI device *after* the ACPI payload
was downloaded. I think neither QEMU should keep regenerating that ACPI
payload, nor OVMF should attempt to notice any such change, and
re-download and re-install the ACPI tables.

Speaking in terms of "UEFI design", whether a PCI device is at the
moment bound & driven by a matching UEFI driver, is (should be)
orthogonal to what the ACPI Platform driver does or does not.


I could try to re-wire the ACPI platform driver, so that it doesn't wake
up when the PCI bus driver completes the enumeration -- see (4)(b) and
(4)(c) above. Instead, I could try to signal it manually from
PlatformBdsLib, *after* the explicit BdsLibConnectAll() -- see (5).

Because at that point, all possible UEFI drivers should be active, and
all recognized PCI devices should be decoding their resources.

However,
- this is incredibly ugly. (Small problem.)
- The big problem is that ACPI tables, particularly the FACS, is a
  *prerequisite* for a number of other firmware features; most
  importantly: S3.

  There's a bunch of S3- and SMM-related, important stuff happening
  *between* the enumeration -- (4)(c) -- and BdsLibConnectAll() -- (5).
  If I wanted to delay the ACPI table installation, that would turn the
  entire logic in-between on its head -- a dependency hell.

  So I don't think I want to do that; it was very hard to get into a
  functional state.


> 2. The same as with pxb, disable Integrated End points for pxb-pcie.

My vote, without a doubt.

> 
> I am going to look at 1., maybe I is doable in a clean way.

My vote: don't. :)

Thanks
Laszlo

> Thanks,
> Marcel
> 
> 
> [...]




reply via email to

[Prev in Thread] Current Thread [Next in Thread]