qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v8 17/20] multi-process: heartbeat messages to remote


From: Elena Ufimtseva
Subject: Re: [PATCH v8 17/20] multi-process: heartbeat messages to remote
Date: Fri, 14 Aug 2020 16:01:47 -0700

On Tue, Aug 11, 2020 at 03:41:30PM +0100, Stefan Hajnoczi wrote:
> On Fri, Jul 31, 2020 at 02:20:24PM -0400, Jagannathan Raman wrote:
> > @@ -343,3 +349,49 @@ static void probe_pci_info(PCIDevice *dev, Error 
> > **errp)
> >          }
> >      }
> >  }
> > +
> > +static void hb_msg(PCIProxyDev *dev)
> > +{
> > +    DeviceState *ds = DEVICE(dev);
> > +    Error *local_err = NULL;
> > +    MPQemuMsg msg = { 0 };
> > +
> > +    msg.cmd = PROXY_PING;
> > +    msg.bytestream = 0;
> > +    msg.size = 0;
> > +
> > +    (void)mpqemu_msg_send_and_await_reply(&msg, dev->ioc, &local_err);
> > +    if (local_err) {
> > +        error_report_err(local_err);
> > +        qio_channel_close(dev->ioc, &local_err);
> > +        error_setg(&error_fatal, "Lost contact with device %s", ds->id);
> > +    }
> > +}
> 
> Here is my feedback from the last revision. Was this addressed?
>

Hi Stefan,

Thank you for reviewing the patchset. In this version we decided to 
shutdown the guest when the heartbeat did not get a reply from the
remote by setting the error_fatal.
Should we approach it differently or you prefer us to get rid of the
heartbeat in this form?

Thank you,
Elena

>   This patch seems incomplete since no action is taken when the device
>   fails to respond. vCPU threads that access the device will still get
>   stuck.
> 
>   The simplest way to make this useful is to close the connection when a
>   timeout occurs. Then the G_IO_HUP handler for the UNIX domain socket
>   should perform connection cleanup. At that point there are a few
>   choices:
> 
>   1. Stop guest execution and wait for the host admin to restore the
>      mplink so execution can resume. This is similar to how -drive
>      rerror=stop pauses the guest when a disk I/O error is encountered.
> 
>   2. Stop guest execution but defer it until this stale device is actually
>      accessed. This maximizes guest uptime. Guests that rarely access the
>      device may not notice at all.
> 
>   3. Return 0 from MemoryRegion read operations and ignore writes. The
>      guest continues executing but the device is broken. This is risky
>      because device drivers inside the guest may not be ready to deal with
>      this. The result could be data loss or corruption.
> 
>   4. Raise a bus-level event. Maybe PCI error reporting can be used to
>      offline the device.
> 
>   5. Terminate the guest with an error message.
> 
>   6. ?
> 
>   Until the heartbeat is fully implemented and tested I suggest dropping
>   it from this patch series. Remember the G_IO_HUP will happen anyway if
>   the remote device process terminates.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]