qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 5/7] vfio-pci: pass the aer error to guest


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH v5 5/7] vfio-pci: pass the aer error to guest
Date: Tue, 24 Mar 2015 20:31:03 -0600

On Wed, 2015-03-25 at 09:33 +0800, Chen Fan wrote:
> On 03/16/2015 10:09 PM, Alex Williamson wrote:
> > On Mon, 2015-03-16 at 15:35 +0800, Chen Fan wrote:
> >> On 03/16/2015 11:52 AM, Alex Williamson wrote:
> >>> On Mon, 2015-03-16 at 11:05 +0800, Chen Fan wrote:
> >>>> On 03/14/2015 06:34 AM, Alex Williamson wrote:
> >>>>> On Thu, 2015-03-12 at 18:23 +0800, Chen Fan wrote:
> >>>>>> when the vfio device encounters an uncorrectable error in host,
> >>>>>> the vfio_pci driver will signal the eventfd registered by this
> >>>>>> vfio device, the results in the qemu eventfd handler getting
> >>>>>> invoked.
> >>>>>>
> >>>>>> this patch is to pass the error to guest and have the guest driver
> >>>>>> recover from the error.
> >>>>> What is going to be the typical recovery mechanism for the guest?  I'm
> >>>>> concerned that the topology of the device in the guest doesn't
> >>>>> necessarily match the topology of the device in the host, so if the
> >>>>> guest were to attempt a bus reset to recover a device, for instance,
> >>>>> what happens?
> >>>> the recovery mechanism is that when guest got an aer error from a device,
> >>>> guest will clean the corresponding status bit in device register. and for
> >>>> need reset device, the guest aer driver would reset all devices under 
> >>>> bus.
> >>> Sorry, I'm still confused, how does the guest aer driver reset all
> >>> devices under a bus?  Are we talking about function-level, device
> >>> specific reset mechanisms or secondary bus resets?  If the guest is
> >>> performing secondary bus resets, what guarantee do they have that it
> >>> will translate to a physical secondary bus reset?  vfio may only do an
> >>> FLR when the bus is reset or it may not be able to do anything depending
> >>> on the available function-level resets and physical and virtual topology
> >>> of the device.  Thanks,
> >> in general, functions depends on the corresponding device driver behaviors
> >> to do the recovery. e.g: implemented the error_detect, slot_reset 
> >> callbacks.
> >> and for link reset, it usually do secondary bus reset.
> >>
> >> and do we must require to the physical secondary bus reset for vfio device
> >> as bus reset?
> > That depends on how the guest driver attempts recovery, doesn't it?
> > There are only a very limited number of cases where a secondary bus
> > reset initiated by the guest will translate to a secondary bus reset of
> > the physical device (iirc, single function device without FLR).  In most
> > cases, it will at best be translated to an FLR.  VFIO really only does
> > bus resets on VM reset because that's the only time we know that it's ok
> > to reset multiple devices.  If the guest driver is depending on a
> > secondary bus reset to put the device into a recoverable state and we're
> > not able to provide that, then we're actually reducing containment of
> > the error by exposing AER to the guest and allowing it to attempt
> > recovery.  So in practice, I'm afraid we're risking the integrity of the
> > VM by exposing AER to the guest and making it think that it can perform
> > recovery operations that are not effective.  Thanks,
> Hi Alex,
> 
>      if guest driver need reset a vfio device by secondary bus reset when
> an aer occured. how about keeping the behavior by stopping VM and
> output an fatal error information to user.

That sounds like a very fragile heuristic to try to associate the reason
for a secondary bus reset based on the timing of an AER notification.
How can we be sure there's an association?  Is it still worthwhile to
allow the guest to participate in recovery or will most of the cases
just stall the VM stop until a bus reset is attempted?  Thanks,

Alex





reply via email to

[Prev in Thread] Current Thread [Next in Thread]