[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [patch v6 11/12] vfio: register aer resume notification
From: |
Alex Williamson |
Subject: |
Re: [Qemu-devel] [patch v6 11/12] vfio: register aer resume notification handler for aer resume |
Date: |
Wed, 11 May 2016 14:20:28 -0600 |
On Wed, 11 May 2016 11:11:39 +0800
Zhou Jie <address@hidden> wrote:
> Hi, Alex
> What do you think about the following solution?
> 1. Detect support for resume notification.
> If host vfio driver does not have resume notifier flags,
> Directly fail to boot up VM as with aer enabled.
It's not a flag used to detect the resume notifier, but simply probing
IRQ_INFO for the index allocated for this notification.
> 2. Immediately notify the VM on error detected.
> 3. Stall any access to the device until resume is signaled.
> Disable mmaps, drop writes, return -1 for reads.
> 4. Delay the guest directed bus reset.
> Don't reset bus in vfio_pci_reset function.
> 5. Wait for resume notification.
> If we don't get the resume notification from the host after
> some timeout, we would abort the guest directed bus reset
> altogether and make the device disappear,
> Initiating an unplug of the device to prevent it from further
> interacting with the VM.
> 6. After get the resume notification.
> Reset bus.
> It the second bus reset. Because the host did bus reset already.
> But as you said we shouldn't necessarily design the API that
> strictly around the current behavior of the Linux AER handler.
Otherwise it sounds like what I had proposed. Thanks,
Alex
> On 2016/5/7 0:39, Alex Williamson wrote:
> > On Fri, 6 May 2016 09:38:41 +0800
> > Chen Fan <address@hidden> wrote:
> >
> >> On 04/26/2016 10:48 PM, Alex Williamson wrote:
> >>> On Tue, 26 Apr 2016 11:39:02 +0800
> >>> Chen Fan<address@hidden> wrote:
> >>>
> >>>> On 04/14/2016 09:02 AM, Chen Fan wrote:
> >>>>> On 04/12/2016 05:38 AM, Alex Williamson wrote:
> >>>>>> On Tue, 5 Apr 2016 19:42:02 +0800
> >>>>>> Cao jin<address@hidden> wrote:
> >>>>>>
> >>>>>>> From: Chen Fan<address@hidden>
> >>>>>>>
> >>>>>>> for supporting aer recovery, host and guest would run the same aer
> >>>>>>> recovery code, that would do the secondary bus reset if the error
> >>>>>>> is fatal, the aer recovery process:
> >>>>>>> 1. error_detected
> >>>>>>> 2. reset_link (if fatal)
> >>>>>>> 3. slot_reset/mmio_enabled
> >>>>>>> 4. resume
> >>>>>>>
> >>>>>>> it indicates that host will do secondary bus reset to reset
> >>>>>>> the physical devices under bus in step 2, that would cause
> >>>>>>> devices in D3 status in a short time. but in qemu, we register
> >>>>>>> an error detected handler, that would be invoked as host broadcasts
> >>>>>>> the error-detected event in step 1, in order to avoid guest do
> >>>>>>> reset_link when host do reset_link simultaneously. it may cause
> >>>>>>> fatal error. we introduce a resmue notifier to assure host reset
> >>>>>>> completely. then do guest aer injection.
> >>>>>> Why is it safe to continue running the VM between the error detected
> >>>>>> notification and the resume notification? We're just pushing back the
> >>>>>> point at which we inject the AER into the guest, potentially negating
> >>>>>> any benefit by allowing the VM to consume bad data. Shouldn't we
> >>>>>> instead be immediately notifying the VM on error detected, but stalling
> >>>>>> any access to the device until resume is signaled? How do we know that
> >>>>>> resume will ever be signaled? We have both the problem that we may be
> >>>>>> running on an older kernel that won't support a resume notification and
> >>>>>> the problem that seeing a resume notification depends on the host being
> >>>>>> able to successfully complete a link reset after fatal error. We can
> >>>>>> detect support for resume notification, but we still need a strategy
> >>>>>> for never receiving it. Thanks,
> >>>>> That's make sense, but I haven't came up with a good idea. do you have
> >>>>> any idea, Alex?
> >>> I don't know that there are any good solutions here. We need to
> >>> respond to the current error notifier interrupt and not regress from
> >>> our support there. I think that means that if we want to switch from a
> >>> simple halt-on-error to a mechanism for the guest to handle recovery,
> >>> we need to disable access to the device between being notified that the
> >>> error occurred and being notified to resume. We can do that by
> >>> disabling mmaps to the device and preventing access via the slow path
> >>> handlers. I don't know what the best solution is for preventing access,
> >>> do we block and pause the VM or do we drop writes and return -1 for
> >>> reads, that's something that needs to be determined. We also need to
> >>> inject the AER into the VM at the point we're notified of an error
> >>> because the VM needs to know as soon as possible to stop using the
> >>> device or trusting any data from it. The next coordination point would
> >>> be something like the resume notifier that you've added and there are
> >>> numerous questions around the interaction of that with the guest
> >>> handling. Clearly we can't do a guest directed bus reset until we get
> >>> the resume notifier, so do we block that execution path in QEMU until
> >>> the resume notification is received? What happens if we don't get that
> >>> notification? Is there any way that we can rely on the host having
> >>> done a bus reset to the point where we don't need to act on the guest
> >>> directed reset? These are all things that need to be figured out.
> >>> Thanks,
> >> Maybe we can simply pause the vcpu running and avoid the VM to
> >> access the device. and add two flags in VFIO_DEVICE_GET_INFO to query
> >> whether the vfio pci driver has a resume notifier,
> >> if it does not have resume notifier flags, we can directly fail to boot
> >> up VM
> >> as with aer enabled.
> >
> > We can already tell if a resume interrupt is supported between the IRQ
> > count in vfio_device_info and a probe with vfio_irq_info, what would
> > additional flags in vfio_device_info tell us beyond a resume interrupt
> > being supported? Is pausing the VM acceptable from a service guarantee
> > perspective to users? A bus reset can take a full second and I imagine
> > deeper PCI hierarchies can push that out depending on what level the
> > error occurs. A second of downtime may be enough to trigger failovers
> > to other systems. If we were to disable mmaps when a fault occurs, we
> > could trap any further device access, drop writes, return -1 for
> > reads. This seems reasonable since we've already notified the VM that
> > the device had a fault. The synchronization point seems like when the
> > guest tries to do a bus reset, we need to block that until we get the
> > resume notification from the host. Perhaps if that doesn't occur after
> > some timeout, we would abort the guest directed bus reset altogether
> > and make the device disappear, perhaps even initiating an unplug of the
> > device to prevent it from further interacting with the VM.
> >
> >> otherwise, we should wait for resume notifier coming to
> >> restart the cpu. about the problem of the reduplicated bus reset by host
> >> and guest,
> >> I think qemu can according to the error is fatal or non-fatal to decide
> >> whether need
> >> to do a bus reset on guest, I think it's not critical and could be
> >> resolved later.
> >
> > The vfio error interrupt doesn't signal non-fatal errors afaik. I'm
> > also not sure we have an guarantee that the host has performed a bus
> > reset, we shouldn't necessarily design the API that strictly around the
> > current behavior of the Linux AER handler. So I don't know that
> > there's any practical way to avoid duplicate bus resets between host
> > and guest recovery. Thanks,
> >
> > Alex
> >
> >
> > .
> >
>
>
>