qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 7/7] pc: add PC_I440FX_COMPAT to disable aerc


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [PATCH v5 7/7] pc: add PC_I440FX_COMPAT to disable aercap for vifo device
Date: Wed, 18 Mar 2015 19:56:37 +0100

On Wed, Mar 18, 2015 at 12:08:15PM -0600, Alex Williamson wrote:
> On Wed, 2015-03-18 at 18:45 +0100, Michael S. Tsirkin wrote:
> > On Wed, Mar 18, 2015 at 11:11:28AM -0600, Alex Williamson wrote:
> > > On Wed, 2015-03-18 at 17:44 +0100, Michael S. Tsirkin wrote:
> > > > On Wed, Mar 18, 2015 at 09:45:29AM -0600, Alex Williamson wrote:
> > > > > On Wed, 2015-03-18 at 16:02 +0100, Michael S. Tsirkin wrote:
> > > > > > On Wed, Mar 18, 2015 at 08:50:54AM -0600, Alex Williamson wrote:
> > > > > > > On Wed, 2015-03-18 at 15:36 +0100, Michael S. Tsirkin wrote:
> > > > > > > > On Wed, Mar 18, 2015 at 08:15:01AM -0600, Alex Williamson wrote:
> > > > > > > > > On Wed, 2015-03-18 at 15:05 +0100, Michael S. Tsirkin wrote:
> > > > > > > > > > On Wed, Mar 18, 2015 at 08:02:26AM -0600, Alex Williamson 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Wed, 2015-03-18 at 14:23 +0100, Michael S. Tsirkin 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > typo in subject: vfio, not vifo.
> > > > > > > > > > > > 
> > > > > > > > > > > > On Thu, Mar 12, 2015 at 06:23:59PM +0800, Chen Fan 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > for piix4 chipset, we don't need to expose aer, so 
> > > > > > > > > > > > > introduce
> > > > > > > > > > > > > PC_I440FX_COMPAT for all piix4 machines to disable 
> > > > > > > > > > > > > aercap,
> > > > > > > > > > > > > and add HW_COMPAT_2_2 to disable aercap for all lower
> > > > > > > > > > > > > than 2.3.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Chen Fan <address@hidden>
> > > > > > > > > > > > 
> > > > > > > > > > > > Well vfio is never migrated ATM.
> > > > > > > > > > > > So why is compat code needed at all?
> > > > > > > > > > > 
> > > > > > > > > > > It's not for migration, it's to maintain current behavior 
> > > > > > > > > > > on existing
> > > > > > > > > > > platforms.  If someone gets an uncorrected AER error on 
> > > > > > > > > > > q35 machine type
> > > > > > > > > > > today, the VM stops.  With this change, AER would be 
> > > > > > > > > > > exposed to the
> > > > > > > > > > > guest and the guest could handle it.  The compat change 
> > > > > > > > > > > therefore
> > > > > > > > > > > maintains the stop VM behavior on existing q35 machine 
> > > > > > > > > > > types.
> > > > > > > > > > 
> > > > > > > > > > If stop VM behaviour is useful, expose it to users.
> > > > > > > > > > If not, then don't.
> > > > > > > > > > I don't see why does it have to be tied to machine types.
> > > > > > > > > 
> > > > > > > > > Because q35-2.2 machine type will currently do a stop VM on 
> > > > > > > > > uncorrected
> > > > > > > > > AER error.  If we don't tie that to a machine option then 
> > > > > > > > > q35-2.2 would
> > > > > > > > > suddenly start exposing the error to the guest.  That's a 
> > > > > > > > > fairly
> > > > > > > > > significant change in behavior for a static machine type.
> > > > > > > > 
> > > > > > > > I don't think you can classify it as a behaviour change. VM 
> > > > > > > > stop is not
> > > > > > > > guest visible behaviour.
> > > > > > > 
> > > > > > > In one case, an uncorrected AER occurs and the VM is stopped by 
> > > > > > > QEMU.
> > > > > > > In the other case, the guest is notified and may attempt 
> > > > > > > corrective
> > > > > > > action... or maybe the guest doesn't understand AER and the user 
> > > > > > > is
> > > > > > > depending on the previous behavior.  That is absolutely a behavior
> > > > > > > change.
> > > > > > > 
> > > > > > > > Are you worrying about guests misbehaving when they see these 
> > > > > > > > errors?
> > > > > > > > Then you want this as user-controlled, supported option.
> > > > > > > 
> > > > > > > Whether the option is user visible is tangential to whether the 
> > > > > > > behavior
> > > > > > > of existing machine types should be maintained.  Existing machine 
> > > > > > > types
> > > > > > > can impose a different default than current machine types.
> > > > > > >
> > > > > > > > In other words: we only tie things to machine types when we
> > > > > > > > have to. This code gets almost no testing, and is a lot of
> > > > > > > > work to test. This one sounds like "just in case" is not a good
> > > > > > > > motivation.
> > > > > > > 
> > > > > > > It seems like an obvious use case for using machine types to 
> > > > > > > maintain
> > > > > > > compatibility with previous behavior, which is exactly why we have
> > > > > > > machine types.  If we're not going to use it, why do we have it?
> > > > > > 
> > > > > > We have machine types because of the following issues:
> > > > > > - some silent changes confuse guests. For example guest installed 
> > > > > > with
> > > > > >   one machine type might not boot if you try to use it after
> > > > > >   changing something, or - in case of windows - throw up warnings.
> > > > > > - some changes break migration
> > > > > > 
> > > > > > Looks like none of these cases.
> > > > > > If AER is unsafe, turn it off by default for everyone.
> > > > > 
> > > > > This is silly, we have the tools, let's use them.
> > > > 
> > > > It's a very expensive tool, maintainance-wise. We often don't
> > > > have the choice but I'm not going to use this tool by choice
> > > > unless we know why we are doing this.
> > > > 
> > > > > If a user is running
> > > > > a VM that gets a VM stop on AER error one day and they upgrade QEMU 
> > > > > and
> > > > > restart it, they should get the same behavior, whether a migration is
> > > > > involved or not.
> > > > 
> > > > You keep saying this, but why should it? Answer that question, the rest
> > > > will follow.
> > > 
> > > My answer is that it's a user visible change in behavior that they may
> > > rely on and would not expect to change within an existing machine type.
> > > Silently modifying the default may expose them to error conditions that
> > > were previously handled by QEMU and adds AER handling requirements to
> > > the guest.
> > > 
> > > Ignoring the question about whether an AER can be reliably recovered in
> > > the guest for a moment, a typical PC hardware platform will signal the
> > > running OS with AER errors, so it seems that *if* we can achieve parity
> > > in a virtual machine with that signaling, and more importantly the
> > > recovery process, then the default going forward should be to mimic bare
> > > metal behavior, thus changing the default from what we do today.
> > > 
> > > As it is now, I think we have too many outstanding questions about how
> > > recover occurs to change the default.
> > > 
> > > So let me return the question, if we were to resolve those questions and
> > > change the default handling from what we do today, creating a user
> > > visible behavior change and imposing new requirements for AER handling
> > > in the guest, why is that not worthy of machine type stability?  I'd
> > > certainly expect it from a distro specific machine type.
> > 
> > This really depends on what guests do with this.
> > We try to avoid guest visible changes during migration
> > (even that's not 100%).
> > There aren't such requirements for devices where migration
> > is disabled, e.g. we did many guest visible changes in q35.
> > 
> > Latest machine types are better tested. Deviating from latest machine
> > just for a theoretical "this is guest visible" isn't going to give
> > you better stability, it will give you worse stability.
> 
> You keep calling this theoretical.  As it exists today, an uncorrected
> AER error results in a VM stop with no guest intervention or support
> required.  If we were to change the default, the guest would instead be
> notified of the AER and given the opportunity to recover.  That's not
> just guest visible, that's a complete change in error handling paradigm.

OK, so maybe it's a feature that users should have control over.
But tying it to machine types makes no sense.

> > If you are worried about guest bugs, they are not going
> > away just because we release new QEMU.So if guests are
> > buggy with this feature, this needs a solution for all
> > machine types.
> 
> Guest bugs are not my issue.
> 
> > Anyway, if we keep the default, that's even easier.
> 
> Agreed, and that may be where we end if we can't come up with a
> reasonable expectation that the guest can recover from an AER.  However,
> I think machine compatibility should be fair game if we do want to flip
> that switch.
> 
> > > > > Maybe the default should be disabled, this patch
> > > > > series hasn't yet even convinced me that there's a worthwhile general
> > > > > case where the guest can recover, but using the existing machine
> > > > > compatibility infrastructure should be at our disposal if we do think
> > > > > the default going forward should be different than the behavior today.
> > > > 
> > > > I'm sorry, I don't think it's the right tool for the job.
> > > > 
> > > 
> > > 
> 
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]