qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 00/18] virtio-blk: Support "VIRTIO_CONFIG_S_NEED


From: Cornelia Huck
Subject: Re: [Qemu-devel] [PATCH 00/18] virtio-blk: Support "VIRTIO_CONFIG_S_NEEDS_RESET"
Date: Tue, 21 Apr 2015 10:04:47 +0200

On Tue, 21 Apr 2015 15:44:02 +0800
Fam Zheng <address@hidden> wrote:

> On Mon, 04/20 17:13, Cornelia Huck wrote:
> > On Fri, 17 Apr 2015 15:59:15 +0800
> > Fam Zheng <address@hidden> wrote:
> > 
> > > Currently, virtio code chooses to kill QEMU if the guest passes any 
> > > invalid
> > > data with vring. That has drawbacks such as losing unsaved data (e.g. when
> > > guest user is writing a very long email), or possible denial of service in
> > > a nested vm use case where virtio device is passed through.
> > > 
> > > virtio-1 has introduced a new status bit "NEEDS RESET" which could be 
> > > used to
> > > improve this by communicating the error state between virtio devices and
> > > drivers. The device notifies guest upon setting the bit, then the guest 
> > > driver
> > > should detect this bit and report to userspace, or recover the device by
> > > resetting it.
> > > 
> > > This series makes necessary changes in virtio core code, based on which
> > > virtio-blk is converted. Other devices now keep the existing behavior by
> > > passing in "error_abort". They will be converted in following series. The 
> > > Linux
> > > driver part will also be worked on.
> > > 
> > > One concern with this behavior change is that it's now harder to notice 
> > > the
> > > actual driver bug that caused the error, as the guest continues to run.  
> > > To
> > > address that, we could probably add a new error action option to virtio
> > > devices,  similar to the "read/write werror" in block layer, so the vm 
> > > could be
> > > paused and the management will get an event in QMP like pvpanic.  This 
> > > work can
> > > be done on top.
> > 
> > In principle, this looks nice; I'm not sure however how this affects
> > non-virtio-1 devices.
> > 
> > If a device is operating in virtio-1 mode, everything is clearly
> > specified: The guest is notified and if it is aware of the NEEDS_RESET
> > bit, it can react accordingly.
> > 
> > But what about legacy devices? Even if they are notified, they don't
> > know to check for NEEDS_RESET - and I'm not sure if the undefined
> > behaviour after NEEDS_RESET might lead to bigger trouble than killing
> > off the guest.
> > 
> 
> The device should become unresponsive to VQ output until guest issues a reset
> through bus commands.  Do you have an example of "big trouble" in mind?

I'm not sure what's supposed to happen if NEEDS_RESET is set but not
everything is fenced off. The guest may see that queues have become
unresponsive, but if we don't stop ioeventfds and fence off
notifications, it may easily get into an undefined state internally.
And if it is connected to other guests via networking, having it limp
on may be worse than just killing it off. (Which parts of the data have
been cleanly written to disk and which haven't? How is it going to get
out of that pickle if it has no good idea of what is wrong?)

If I have to debug a non-working guest, I prefer a crashed one with a
clean state over one that has continued running after the error
occurred. For legacy, I vote for killing the guest off as before; a
virtio-1 aware guest can choose to either recover the device via reset,
if possible, or die of its own accord if the problem is non-recoverable.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]