[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
|
From: |
Juan Quintela |
|
Subject: |
Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility |
|
Date: |
Wed, 17 May 2023 13:43:46 +0200 |
|
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) |
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote:
>> When we detect that we have broken backwards compantibility in a
>> released version, we can't do anything for that version. But once we
>> fix that bug on the next released version, we can "mitigate" that
>> problem when migrating to new versions to give a way out of that
>> machine until it does a hard reboot.
>>
>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>> ---
>> docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 194 insertions(+)
>>
>> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
>> index 95e797ee60..97b6f48474 100644
>> --- a/docs/devel/migration.rst
>> +++ b/docs/devel/migration.rst
>> @@ -451,6 +451,200 @@ binary in both sides of the migration. If we use
>> different QEMU
>> versions process, then we need to have into account all other
>> differences and the examples become even more complicated.
>>
>> +How to mitigate when we have a backward compatibility error
>> +-----------------------------------------------------------
>> +
>> +We broke migration for old machine types continously during
>> +development. But as soon as we find that there is a problem, we fix
>> +it. The problem is what happens when we detect after we have done a
>> +release that something has gone wrong.
>> +
>> +Let see how it worked with one example.
>> +
>> +After the release of qemu-8.0 we found a problem when doing migration
>> +of the machine type pc-7.2.
>> +
>> +- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
>> +
>> + This migration works
>> +
>> +- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
>> +
>> + This migration works
>> +
>> +- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2
>> +
>> + This migration fails
>> +
>> +- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
>> +
>> + This migration fails
>> +
>> +So clearly something fails when migration between qemu-7.2 and
>> +qemu-8.0 with machine type pc-7.2. The error messages, and git bisect
>> +pointed to this commit.
>> +
>> +In qemu-8.0 we got this commit: ::
>> +
>> + commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
>> + Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> + Date: Thu Mar 2 13:37:03 2023 +0000
>> +
>> + hw/pci/aer: Add missing routing for AER errors
>> +
>> +The relevant bits of the commit for our example are this ones:
>> +
>> + --- a/hw/pci/pcie_aer.c
>> + +++ b/hw/pci/pcie_aer.c
>> + @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
>> +
>> + pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
>> + PCI_ERR_UNC_SUPPORTED);
>> + + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
>> + + PCI_ERR_UNC_MASK_DEFAULT);
>> + + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
>> + + PCI_ERR_UNC_SUPPORTED);
>> +
>> + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
>> + PCI_ERR_UNC_SEVERITY_DEFAULT);
>> +
>> +The patch changes how we configure pci space for AER. But qemu fails
>> +when the pci space configuration is different betwwen source and
>> +destination.
>> +
>> +The following commit show how this got fixed:
>> +
>> +<put info of the commit once that it arrives upstream>
>> +
>> +The relevant parts of the fix are as follow:
>> +
>> +First, we create a new property for the device to be able to configure
>> +the old behaviour or the new behaviour. ::
>> +
>> + diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> + index 8a87ccc8b0..5153ad63d6 100644
>> + --- a/hw/pci/pci.c
>> + +++ b/hw/pci/pci.c
>> + @@ -79,6 +79,8 @@ static Property pci_props[] = {
>> + DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
>> + failover_pair_id),
>> + DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0),
>> + + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
>> + + QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
>> + DEFINE_PROP_END_OF_LIST()
>> + };
>> +
>> +Notice that we enable te feature for new machine types.
>> +
>> +Now we see how the fix is done. This is going to depend on what kind
>> +of breakage happens, but in this case it is quite simple. ::
>> +
>> + diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
>> + index 103667c368..374d593ead 100644
>> + --- a/hw/pci/pcie_aer.c
>> + +++ b/hw/pci/pcie_aer.c
>> + @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
>> + uint16_t offset,
>> +
>> + pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
>> + PCI_ERR_UNC_SUPPORTED);
>> + - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
>> + - PCI_ERR_UNC_MASK_DEFAULT);
>> + - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
>> + - PCI_ERR_UNC_SUPPORTED);
>> + +
>> + + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
>> + + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
>> + + PCI_ERR_UNC_MASK_DEFAULT);
>> + + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
>> + + PCI_ERR_UNC_SUPPORTED);
>> + + }
>> +
>> + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
>> + PCI_ERR_UNC_SEVERITY_DEFAULT);
>> +
>> +I.e. If the property bit is enabled, we configure it as we did for
>> +qemu-8.0. If the property bit is not set, we configure it as it was in 7.2.
>> +
>> +And now, everything that is missing is disable the feature for old
>> +machine types: ::
>> +
>> + diff --git a/hw/core/machine.c b/hw/core/machine.c
>> + index 47a34841a5..07f763eb2e 100644
>> + --- a/hw/core/machine.c
>> + +++ b/hw/core/machine.c
>> + @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
>> + { "e1000e", "migrate-timadj", "off" },
>> + { "virtio-mem", "x-early-migration", "false" },
>> + { "migration", "x-preempt-pre-7-2", "true" },
>> + + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
>> + };
>> + const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
>> +
>> +And now, when qemu-8.0.1 is released with this fix, all combinations
>> +are going to work as supposed.
>> +
>> +- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
>> +- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
>> +- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
>> +- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
>> +
>> +So the normality has been restaured and everything is ok, no?
>> +
>> +Not really, now our matrix is much bigger. We started with the easy
>> +cases, migration from the same version to the same version always
>> +works:
>> +
>> +- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
>> +- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
>> +
>> +Now the interesting ones. When the QEMU processes versions are
>> +different. For the 1st set, their fail and we can do nothing, both
>> +versions are relased and we can't change anything.
>> +
>> +- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
>> +- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2
>> +
>> +This two are the ones that work. The whole point of making the
>> +change in qemu-8.0.1 release was to fix this issue:
>> +
>> +- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2
>> +
>> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
>> +qemu-8.0.1.
>> +
>> +- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
>> +- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0 -M pc-7.2
>> +
>> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
>> +anything except to qemu-8.0.
>> +
>> +Can we do better?
>> +
>> +Yeap. If we know that we are gonig to do this migration:
>> +
>> +- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
>> +
>> +We can launche the appropiate devices with
>> +
>> +--device...,x-pci-e-err-unc-mask=on
>> +
>> +And now we can receive a migration from 8.0. And from now on, we can
>> +do that migration to new machine types if we remember to enable that
>> +property for pc-7.2. Notice that we need to remember, it is not
>> +enough to know that the source of the migration is qemu-8.0. Think of this
>> example:
>> +
>> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
>> +
>> +In the second migration, the source is not qemu-8.0, but we still have
>> +that "problem" and have that property enabled. Notice that we need to
>> +continue having this mark/property until we have this machine
>> +rebooted. But it is not a normal reboot (that don't reload qemu) we
>> +need the mapchine to poweroff/poweron on a fixed qemu. And from now
>> +on we can use the proper real machine.
>> +
>> VMState
>> -------
>
> Can we release this list of things that need to be configured
> somewhere? Maybe in a sane format that libvirt can parse?
What do you mean here?
the x-pci-e-err-unc-mask=on?
The most similar thing that we have is pc/machine.c:hw_compat_x_y.
But that also include the things where we have done the things right.
Daniel, Jiri, what would you need and what would be useful to you?
Later, Juan.
- [PATCH v3 1/3] migration: Add documentation for backwards compatiblity, (continued)