qemu-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-commits] [qemu/qemu] d522cb: spapr: rollback 'unplug timeout' for


From: Peter Maydell
Subject: [Qemu-commits] [qemu/qemu] d522cb: spapr: rollback 'unplug timeout' for CPU hotunplugs
Date: Tue, 13 Apr 2021 05:10:43 -0700

  Branch: refs/heads/staging
  Home:   https://github.com/qemu/qemu
  Commit: d522cb52e604c8448674d90dae09ad50223b5d3c
      
https://github.com/qemu/qemu/commit/d522cb52e604c8448674d90dae09ad50223b5d3c
  Author: Daniel Henrique Barboza <danielhb413@gmail.com>
  Date:   2021-04-12 (Mon, 12 Apr 2021)

  Changed paths:
    M hw/ppc/spapr.c
    M hw/ppc/spapr_drc.c
    M include/hw/ppc/spapr_drc.h
    M include/qemu/timer.h
    M util/qemu-timer.c

  Log Message:
  -----------
  spapr: rollback 'unplug timeout' for CPU hotunplugs

The pseries machines introduced the concept of 'unplug timeout' for CPU
hotunplugs. The idea was to circunvent a deficiency in the pSeries
specification (PAPR), that currently does not define a proper way for
the hotunplug to fail. If the guest refuses to release the CPU (see [1]
for an example) there is no way for QEMU to detect the failure.

Further discussions about how to send a QAPI event to inform about the
hotunplug timeout [2] exposed problems that weren't predicted back when
the idea was developed. Other QEMU machines don't have any type of
hotunplug timeout mechanism for any device, e.g. ACPI based machines
have a way to make hotunplug errors visible to the hypervisor. This
would make this timeout mechanism exclusive to pSeries, which is not
ideal.

The real problem is that a QAPI event that reports hotunplug timeouts
puts the management layer (namely Libvirt) in a weird spot. We're not
telling that the hotunplug failed, because we can't be 100% sure of
that, and yet we're resetting the unplug state back, preventing any
DEVICE_DEL events to reach out in case the guest decides to release the
device. Libvirt would need to inspect the guest itself to see if the
device was released or not, otherwise the internal domain states will be
inconsistent.  Moreover, Libvirt already has an 'unplug timeout'
concept, and a QEMU side timeout would need to be juggled together with
the existing Libvirt timeout.

All this considered, this solution ended up creating more trouble than
it solved. This patch reverts the 3 commits that introduced the timeout
mechanism for CPU hotplugs in pSeries machines.

This reverts commit 4515a5f786024fabf0bef4cf3d28adf5647e6e82
"qemu_timer.c: add timer_deadline_ms() helper"

This reverts commit d1c2e3ce3d5a5424651967bce1cf1f4caa0c6d91
"spapr_drc.c: add hotunplug timeout for CPUs"

This reverts commit 51254ffb320183a4636635840c23ee0e3a1efffa
"spapr_drc.c: introduce unplug_timeout_timer"

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1911414
[2] https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg04682.html

CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20210401000437.131140-2-danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>


  Commit: 2b18fc794f312a91482998bae5ea6c8724200e06
      
https://github.com/qemu/qemu/commit/2b18fc794f312a91482998bae5ea6c8724200e06
  Author: Daniel Henrique Barboza <danielhb413@gmail.com>
  Date:   2021-04-12 (Mon, 12 Apr 2021)

  Changed paths:
    M hw/ppc/spapr.c

  Log Message:
  -----------
  spapr.c: always pulse guest IRQ in spapr_core_unplug_request()

Commit 47c8c915b162 fixed a problem where multiple spapr_drc_detach()
requests were breaking QEMU. The solution was to just spapr_drc_detach()
once, and use spapr_drc_unplug_requested() to filter whether we already
detached it or not. The commit also tied the hotplug request to the
guest in the same condition.

Turns out that there is a reliable way for a CPU hotunplug to fail. If a
guest with one CPU hotplugs a CPU1, then offline CPU0s via 'echo 0 >
/sys/devices/system/cpu/cpu0/online', then attempts to hotunplug CPU1,
the kernel will refuse it because it's the last online CPU of the
system. Given that we're pulsing the IRQ only in the first try, in a
failed attempt, all other CPU1 hotunplug attempts will fail, regardless
of the online state of CPU1 in the kernel, because we're simply not
letting the guest know that we want to hotunplug the device.

Let's move spapr_hotplug_req_remove_by_index() back out of the "if
(!spapr_drc_unplug_requested(drc))" conditional, allowing for multiple
'device_del' requests to the same CPU core to reach the guest, in case
the CPU core didn't fully hotunplugged previously.

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20210401000437.131140-3-danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>


  Commit: dce628a97fde2594f99d738883a157f05aa0a14f
      
https://github.com/qemu/qemu/commit/dce628a97fde2594f99d738883a157f05aa0a14f
  Author: Peter Maydell <peter.maydell@linaro.org>
  Date:   2021-04-13 (Tue, 13 Apr 2021)

  Changed paths:
    M hw/ppc/spapr.c
    M hw/ppc/spapr_drc.c
    M include/hw/ppc/spapr_drc.h
    M include/qemu/timer.h
    M util/qemu-timer.c

  Log Message:
  -----------
  Merge remote-tracking branch 'remotes/dg-gitlab/tags/ppc-for-6.0-20210412' 
into staging

ppc patch queue for 2021-04-21

Here's what I hope is the last ppc related pull request for qemu-6.0.

The 2 patches here revert a behavioural change that after further
discussion we concluded was a bad idea (adding a timeout for
possibly-failed hot unplug requests).  Instead it implements a
different approach to the original problem: we again let unplug
requests the guest doesn't respond to remain pending indefinitely, but
no longer allow those to block attempts to retry the same unplug
again.

The change is a bit more complex than I'd like for this late in the
freeze.  Nonetheless, I think it's important to merge this for 6.0, so
we don't allow a release which has the probably-a-bad-idea timeout
behaviour.

# gpg: Signature made Mon 12 Apr 2021 06:25:58 BST
# gpg:                using RSA key 75F46586AE61A66CC44E87DC6C38CACA20D9B392
# gpg: Good signature from "David Gibson <david@gibson.dropbear.id.au>" [full]
# gpg:                 aka "David Gibson (Red Hat) <dgibson@redhat.com>" [full]
# gpg:                 aka "David Gibson (ozlabs.org) <dgibson@ozlabs.org>" 
[full]
# gpg:                 aka "David Gibson (kernel.org) <dwg@kernel.org>" 
[unknown]
# Primary key fingerprint: 75F4 6586 AE61 A66C C44E  87DC 6C38 CACA 20D9 B392

* remotes/dg-gitlab/tags/ppc-for-6.0-20210412:
  spapr.c: always pulse guest IRQ in spapr_core_unplug_request()
  spapr: rollback 'unplug timeout' for CPU hotunplugs

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>


Compare: https://github.com/qemu/qemu/compare/1a66dab9ddab...dce628a97fde



reply via email to

[Prev in Thread] Current Thread [Next in Thread]