qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 7/7] pc: add PC_I440FX_COMPAT to disable aerc


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH v5 7/7] pc: add PC_I440FX_COMPAT to disable aercap for vifo device
Date: Wed, 18 Mar 2015 12:08:15 -0600

On Wed, 2015-03-18 at 18:45 +0100, Michael S. Tsirkin wrote:
> On Wed, Mar 18, 2015 at 11:11:28AM -0600, Alex Williamson wrote:
> > On Wed, 2015-03-18 at 17:44 +0100, Michael S. Tsirkin wrote:
> > > On Wed, Mar 18, 2015 at 09:45:29AM -0600, Alex Williamson wrote:
> > > > On Wed, 2015-03-18 at 16:02 +0100, Michael S. Tsirkin wrote:
> > > > > On Wed, Mar 18, 2015 at 08:50:54AM -0600, Alex Williamson wrote:
> > > > > > On Wed, 2015-03-18 at 15:36 +0100, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Mar 18, 2015 at 08:15:01AM -0600, Alex Williamson wrote:
> > > > > > > > On Wed, 2015-03-18 at 15:05 +0100, Michael S. Tsirkin wrote:
> > > > > > > > > On Wed, Mar 18, 2015 at 08:02:26AM -0600, Alex Williamson 
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, 2015-03-18 at 14:23 +0100, Michael S. Tsirkin wrote:
> > > > > > > > > > > typo in subject: vfio, not vifo.
> > > > > > > > > > > 
> > > > > > > > > > > On Thu, Mar 12, 2015 at 06:23:59PM +0800, Chen Fan wrote:
> > > > > > > > > > > > for piix4 chipset, we don't need to expose aer, so 
> > > > > > > > > > > > introduce
> > > > > > > > > > > > PC_I440FX_COMPAT for all piix4 machines to disable 
> > > > > > > > > > > > aercap,
> > > > > > > > > > > > and add HW_COMPAT_2_2 to disable aercap for all lower
> > > > > > > > > > > > than 2.3.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Chen Fan <address@hidden>
> > > > > > > > > > > 
> > > > > > > > > > > Well vfio is never migrated ATM.
> > > > > > > > > > > So why is compat code needed at all?
> > > > > > > > > > 
> > > > > > > > > > It's not for migration, it's to maintain current behavior 
> > > > > > > > > > on existing
> > > > > > > > > > platforms.  If someone gets an uncorrected AER error on q35 
> > > > > > > > > > machine type
> > > > > > > > > > today, the VM stops.  With this change, AER would be 
> > > > > > > > > > exposed to the
> > > > > > > > > > guest and the guest could handle it.  The compat change 
> > > > > > > > > > therefore
> > > > > > > > > > maintains the stop VM behavior on existing q35 machine 
> > > > > > > > > > types.
> > > > > > > > > 
> > > > > > > > > If stop VM behaviour is useful, expose it to users.
> > > > > > > > > If not, then don't.
> > > > > > > > > I don't see why does it have to be tied to machine types.
> > > > > > > > 
> > > > > > > > Because q35-2.2 machine type will currently do a stop VM on 
> > > > > > > > uncorrected
> > > > > > > > AER error.  If we don't tie that to a machine option then 
> > > > > > > > q35-2.2 would
> > > > > > > > suddenly start exposing the error to the guest.  That's a fairly
> > > > > > > > significant change in behavior for a static machine type.
> > > > > > > 
> > > > > > > I don't think you can classify it as a behaviour change. VM stop 
> > > > > > > is not
> > > > > > > guest visible behaviour.
> > > > > > 
> > > > > > In one case, an uncorrected AER occurs and the VM is stopped by 
> > > > > > QEMU.
> > > > > > In the other case, the guest is notified and may attempt corrective
> > > > > > action... or maybe the guest doesn't understand AER and the user is
> > > > > > depending on the previous behavior.  That is absolutely a behavior
> > > > > > change.
> > > > > > 
> > > > > > > Are you worrying about guests misbehaving when they see these 
> > > > > > > errors?
> > > > > > > Then you want this as user-controlled, supported option.
> > > > > > 
> > > > > > Whether the option is user visible is tangential to whether the 
> > > > > > behavior
> > > > > > of existing machine types should be maintained.  Existing machine 
> > > > > > types
> > > > > > can impose a different default than current machine types.
> > > > > >
> > > > > > > In other words: we only tie things to machine types when we
> > > > > > > have to. This code gets almost no testing, and is a lot of
> > > > > > > work to test. This one sounds like "just in case" is not a good
> > > > > > > motivation.
> > > > > > 
> > > > > > It seems like an obvious use case for using machine types to 
> > > > > > maintain
> > > > > > compatibility with previous behavior, which is exactly why we have
> > > > > > machine types.  If we're not going to use it, why do we have it?
> > > > > 
> > > > > We have machine types because of the following issues:
> > > > > - some silent changes confuse guests. For example guest installed with
> > > > >   one machine type might not boot if you try to use it after
> > > > >   changing something, or - in case of windows - throw up warnings.
> > > > > - some changes break migration
> > > > > 
> > > > > Looks like none of these cases.
> > > > > If AER is unsafe, turn it off by default for everyone.
> > > > 
> > > > This is silly, we have the tools, let's use them.
> > > 
> > > It's a very expensive tool, maintainance-wise. We often don't
> > > have the choice but I'm not going to use this tool by choice
> > > unless we know why we are doing this.
> > > 
> > > > If a user is running
> > > > a VM that gets a VM stop on AER error one day and they upgrade QEMU and
> > > > restart it, they should get the same behavior, whether a migration is
> > > > involved or not.
> > > 
> > > You keep saying this, but why should it? Answer that question, the rest
> > > will follow.
> > 
> > My answer is that it's a user visible change in behavior that they may
> > rely on and would not expect to change within an existing machine type.
> > Silently modifying the default may expose them to error conditions that
> > were previously handled by QEMU and adds AER handling requirements to
> > the guest.
> > 
> > Ignoring the question about whether an AER can be reliably recovered in
> > the guest for a moment, a typical PC hardware platform will signal the
> > running OS with AER errors, so it seems that *if* we can achieve parity
> > in a virtual machine with that signaling, and more importantly the
> > recovery process, then the default going forward should be to mimic bare
> > metal behavior, thus changing the default from what we do today.
> > 
> > As it is now, I think we have too many outstanding questions about how
> > recover occurs to change the default.
> > 
> > So let me return the question, if we were to resolve those questions and
> > change the default handling from what we do today, creating a user
> > visible behavior change and imposing new requirements for AER handling
> > in the guest, why is that not worthy of machine type stability?  I'd
> > certainly expect it from a distro specific machine type.
> 
> This really depends on what guests do with this.
> We try to avoid guest visible changes during migration
> (even that's not 100%).
> There aren't such requirements for devices where migration
> is disabled, e.g. we did many guest visible changes in q35.
> 
> Latest machine types are better tested. Deviating from latest machine
> just for a theoretical "this is guest visible" isn't going to give
> you better stability, it will give you worse stability.

You keep calling this theoretical.  As it exists today, an uncorrected
AER error results in a VM stop with no guest intervention or support
required.  If we were to change the default, the guest would instead be
notified of the AER and given the opportunity to recover.  That's not
just guest visible, that's a complete change in error handling paradigm.

> If you are worried about guest bugs, they are not going
> away just because we release new QEMU.So if guests are
> buggy with this feature, this needs a solution for all
> machine types.

Guest bugs are not my issue.

> Anyway, if we keep the default, that's even easier.

Agreed, and that may be where we end if we can't come up with a
reasonable expectation that the guest can recover from an AER.  However,
I think machine compatibility should be fair game if we do want to flip
that switch.

> > > > Maybe the default should be disabled, this patch
> > > > series hasn't yet even convinced me that there's a worthwhile general
> > > > case where the guest can recover, but using the existing machine
> > > > compatibility infrastructure should be at our disposal if we do think
> > > > the default going forward should be different than the behavior today.
> > > 
> > > I'm sorry, I don't think it's the right tool for the job.
> > > 
> > 
> > 






reply via email to

[Prev in Thread] Current Thread [Next in Thread]