[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] adding a generic QAPI event for failed device hotunplug

From: David Gibson
Subject: Re: [RFC] adding a generic QAPI event for failed device hotunplug
Date: Mon, 29 Mar 2021 16:35:12 +1100

On Tue, Mar 23, 2021 at 02:06:36PM +0100, Igor Mammedov wrote:
> On Tue, 23 Mar 2021 14:33:28 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > On Mon, Mar 22, 2021 at 01:06:53PM +0100, Paolo Bonzini wrote:
> > > On 22/03/21 07:39, David Gibson wrote:  
> > > > > QEMU doesn't really keep track of "in flight" unplug requests, and as
> > > > > long as that's the case, its timeout even will have the same issue.  
> > > > Not generically, maybe.  In the PAPR code we effectively do, by means
> > > > of the 'unplug_requested' boolean in the DRC structure.  Maybe that's
> > > > a mistake, but at the time I couldn't see how else to handle things.  
> > > 
> > > No, that's good.  x86 also tracks it in some registers that are accessible
> > > from the ACPI firmware.  See "PCI slot removal notification" in
> > > docs/specs/acpi_pci_hotplug.txt.
> > >   
> > > > Currently we will resolve all "in flight" requests at machine reset
> > > > time, effectively completing those requests.  Does that differ from
> > > > x86 behaviour?  
> > > 
> > > IIRC on x86 the requests are instead cancelled, but I'm not 100%
> > > sure.  
> > 
> > Ah... we'd better check that and try to make ppc consistent with
> > whatever it does.
> > 
> Sorry for being late to discussion, I can't say much for all possible ways to 
> unplug
> PCI device (aside that it's a complicated mess), but hopefully I can shed 
> some light on
> state/behavior of ACPI based methods.

Thanks, this is quite useful.

> * x86 - ACPI based PCI hotplug
>  Its sole existence was dictated by Widows not supporting SHPC (conventional 
> PCI),
>  and it looks like 'thanks' to Windows buggy drivers we would have to use it 
> for
>  PCI-E  as well (Julia works on it).

AIUI, Julia et al. are working on ACPI PCI hotplug, but it's not yet
merged.  Is that correct?

>  HW registers described in docs/specs/acpi_pci_hotplug.txt are our own 
> invention,
>  they help to raise standard ACPI 'device check' and 'eject request' events 
> when
>  guest executes AML bytecode. Potentially there is possibility for guest to 
> report
>  plug/unplug progress via ACPI _OST method (including failure/completion) but 
> given
>  my experience with how Windows PCI core worked so far that may be not used 
> by it
>  (honestly I haven't tried to explore possibility, due to lack of interest in 
> it).
>  regarding unplug - on device_del QEMU raises SCI interrupt, after this the 
> process is
>  asynchronous. When ACPI interpreter gets SCI it sends a respective _EJ0 
> event to
>  devices mentioned in PCI_DOWN_BASE register. After getting the event, guest 
> OS may

Ok.  Is PCI_DOWN_BASE an actual emulated hardware register, or one of
the invented ones you mention above?

Either way, I'm assuming there must be a PCI_DOWN_BASE register for
each PCI bus, yes?  How is that implemented for PCI to PCI bridges?

>  decide to eject PCI device (which causes clearing of device's bit in 
>  or refuse to do it. There is no any progress tracking in QEMU for failure 
> and device's
>  bit in PCI_DOWN_BASE is kept set. On the next device_(add|del) (for any PCI 
> device)
>  guest will see it again and will retry removal.

Ok.  Sounds like the bits in PCI_DOWN_BASE are roughly equivalent to
our 'unplug_requested' flag.  In our case it's not guest visible on a
continuing basis, though.  For PAPR hotplug, we have an explicit event
queue, and each event just pertains to a specific device (or cpu, or
memory block).

>  Also if guest reboots with any bits in PCI_DOWN_BASE set, respective devices 
> will
>  be deleted on QEMU side.

Ok, that's the same as how we behave: on reset anything with
'unplug_requested' will be deleted.

>  There is no other way to cancel removal request in PCI_DOWN_BASE, aside of 
> explicitly
>  ejecting device on guest request or implicitly on reboot.
>  IMHO:
>      Sticky nature of PCI_(UP|DOWN)_BASE is more trouble than help but its 
> there since
>      SeaBios times so it's ABI we are stuck with. If I were re-implementing 
> it now,
>      I would use one shot event that's cleared once guest read it and if 
> possible
>      implement _OST status reporting (if it works on Windows).
>  As it stands now, once device_del is issued one user can't know when PCI 
> device will be
>  removed. No timeout will help with it.

Ok.  What happens if you retry a device_del on the same device,
because the guest hasn't responded to the first one?

> * ACPI CPU/Memory hotplug
>  Events triggered by device_del are one shot, then guest may report progress 
> to QEMU using
>  _OST method (qapi_event_send_acpi_device_ost) (I know that libvirt were 
> aware of it,
>  but I don't recall what it does with it). So QEMU might send '_UNPLUG_ERROR' 
> event to
>  user if guest decides so. But instead of duplicating all possible events 
> from spec
>  QEMU will pass _OST arguments [1] as is for user to interpret as described 
> by standard.
>  Though I'd say _OST is not 100% reliable, depending used Windows or linux 
> kernel version
>  they might skip on reporting some events. But I don't recall exact state at 
> the time I've
>  been testing it. So I'd call status reporting support as 'best effort'.

Right.  That more or less has to be the case for anything guest
reported.  Even if our guest kernel is supposed to have support for
reporting back, it might fail to due to a bug or unrelated hard freeze.

>  Also it doesn't feature pending removal on reboot, that our ACPI PCI hotplug 
> code has.

Oh, that's interesting.  On PAPR we do have pending removal on reboot
for all hotplug types.  I guess if the behaviour isn't even consistent
within x86 variants, it doesn't matter very much if PAPR is consistent
with x86.

>  So with well behaving guest user will get notified about failure or device 
> removal (when
>  guest is able to run its code), for broken guests I'm more inclined to say 
> 'use fixed guest'
>  to get sane behavior.
>  Policy for user is to retry on failure (there is no bad side effects on 
> retry).

Ok.  That's different for us - we will fail attempts to retry in qemu,
which I now strongly suspect is a mistake.  I think that shouldn't be
too complicated to change, though.

> I think that any kind of timeout here is inherently racy, in async 
> hot[un]plug usecase,
> all user has to do is just sufficiently over-commit host (or run it
> nested).

Ah, I see your point.  No matter how highly the guest prioritizes
handling the unplug event, a slow-running host could mean that it
doesn't complete the handling in time.

Yeah, that's a pretty strong argument against any sort of timeout.

> So it's just a question of how long it will take for user to come back with a 
> bug report. 
> * As far as I'm aware mentioned 'pending_deleted_event' is there to make 
> transparent
>   failover magic happen (CCing Jens, also Michael might know how it works)
> * SHCP & PCI-E has its own set of unplug quirks, which I know little about 
> but Julia worked
>   with Michael on fixing PCI-E bugs (mostly related how Windows drivers 
> handle unplug,
>   some are not possible to fix, hence decision to add ACPI based hotplug to 
> Q35 as workaround).
>   So they might know specifics.

Ok.  I know I've run into a bunch of complications with SHPC and
pciehp for unrelated work (Kata & SR-IOV).  Julia and/or Michael, if
you're able to answer for SHPC and pcieh these two questions, that
would be pretty useful:

   a) If you device_del, but the guest doesn't respond to the request
      at all, what happens?  If you reboot while in that state, does
      the unplug get completed, or cancelled?

   b) Can you safely retry a device_del which is "pending"
      (i.e. issued by qemu, but the guest hasn't responded yet)

> 1) ACPI spec: _OST (OSPM Status Indication)

David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!

Attachment: signature.asc
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]