qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 1/1] i386: Remove features from Epyc-Milan cpu


From: Daniel P . Berrangé
Subject: Re: [RFC PATCH 1/1] i386: Remove features from Epyc-Milan cpu
Date: Tue, 1 Feb 2022 09:18:16 +0000
User-agent: Mutt/2.1.5 (2021-12-30)

On Mon, Jan 31, 2022 at 05:18:04PM -0300, Leonardo Bras Soares Passos wrote:
> On Mon, Jan 31, 2022 at 3:04 PM Daniel P. Berrangé <berrange@redhat.com> 
> wrote:
> >
> > On Mon, Jan 31, 2022 at 02:56:38PM -0300, Leonardo Bras Soares Passos wrote:
> > > What I meant here is:
> > > 1 - Host with these feature bits start a VM with EPYC-Milan cpu (and
> > > thus have those bits enabled)
> > > 2 - Guest is migrated to a host such as the above, which does not
> > > support those features (bits disabled), but does support EPYC-Milan
> > > cpus (without those features).
> > > 3 - The migration should be allowed, given the same cpu types. Then
> > > either we have:
> > > 3a : The guest vcpu stays with the flag enabled (case I tried to
> > > explain above), possibly crashing if the new feature is used, or
> > > 3b: The guest vcpu disables the flag due to incompatibility,  which
> > > may make the guest confuse due to cpu change, and even end up trying
> > > to use the new feature on the guest, even if it's disabled.
> >
> > Neither should happen with a correctly written mgmt app in charge.
> >
> > When launching a QEMU process for an incoming migration, it is expected
> > that the mgmt app has first queried QEMU on the source to get the precise
> > CPU model + flags that were added/removed on the source. The QEMU on
> > the target is then launched with this exact set of flags, and the
> > 'check' flag is also set for -cpu. That will cause QEMU on the target
> > to refuse to start unless it can give the guest the 100% identical
> > CPUID to what has been requested on the CLI, and thus matching the
> > source.
> >
> > Libvirt will ensure all this is done correctly. If not using libvirt
> > then you've got a bunch of work to do to achieve this. It certainly
> > isn't sufficient to merely use the same plain '-cpu' arg that the
> > soruce was original booted with, unless you have 100% identical
> > hardware, microcode, and software on both hosts, or the target host
> > offers a superset of features.
> 
> Oh, that is very interesting! Thanks for sharing!
> 
> Well, then at least one unexpected scenario should happen:
> - VM with EPYC-Milan cpu, created in source host
> - Source host with EPYC-Milan cpu. Support for 'extra features'
> enabled ( erms / fsrm in this ex.)
> - Target host with EPYC-Milan cpu. No support for 'extra features'.
> Since the VM will be created with support for 'extra features', trying
> to migrate from source host to target host should fail, right?
> 
> Which is, IMHO, odd. I imagine questions like:

Yes, it can certainly be surprising to users. It is a never ending
source of support requests from users. Note this isn't an AMD problem,
it affects Intel too, and indeed any scenario where features can be
hidden/visible based on firmware settings or microcode updates.

The classic is Intel removing the TSX related features in microcode
updates, which results in their CPUs loosing the hle and rtm features.
This has caused migration compatibility pain for so many people.

> - "How does a host with EPYC-Milan cpu does not offer support to
> receive a live migration of some VMs with EPYC-Milan cpu?", or even
> - "If I can create a VM with EPYC-Milan cpu on that host, why can't I
> receive (via migration) some VMs with EPYC-Milan CPU ?"

Yes, these are exactly the questions we get from users quite
frequently.

Ultimately we need to explain that there's more to CPU compatibility
than merely the physical hardware, rather it covers

 - Physical CPU
 - Microcode update
 - Firmware settings
 - Host kernel version
 - QEMU version

Any one of those pieces can prevent a given feature being usable
by the guest, and so be the cause of live migration compatibility
trouble.

The number 1 priority is that mgmt apps don't allow the migration
to start if there is such an incompatibility, and we're pretty
good at that now.

After that is becomes a documentation and training problem. It is
important to understand that if users have a cluster of machines that
they want to live migrate between, keeping those 5 pieces in sync
across all machines is very important. Microcode is usually the most
trouble, since it is the one that actively removes existing features
most frequently. We've had the kernel remove features proactively
though, to prevent VMs using them, in the expectation that a future
microcode update might later remove the same feature.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




reply via email to

[Prev in Thread] Current Thread [Next in Thread]