Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?

From: David Gibson
Subject: Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?
Date: Mon, 19 Jul 2021 12:31:06 +1000

On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
> Commit 2500fb423a "migration: Include migration support for machine
> check handling" adds this:
>     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err);
>     if (ret == -EBUSY) {
>         /*
>          * We don't want to abort so we let the migration to continue.
>          * In a rare case, the machine check handler will run on the target.
>          * Though this is not preferable, it is better than aborting
>          * the migration or killing the VM.
>          */
>         warn_report("Received a fwnmi while migration was in progress");
>     }
> migrate_add_blocker() can fail in two ways:
> 1. -EBUSY: migration is already in progress
>    Ignoring this one is clearly intentional.  The comment explains why.
>    I'm taking it at face value (I'm a spapr ignoramus).

Right.  The argument isn't really about papr particularly, except
insofar as understanding what fwnmi is.  fwnmi (FirmWare assisted NMI)
is a reporting mechanism for certain low-level hardware failures
(think memory ECC or cpu level faults, IIRC).  If we migrate between
detecting and reporting the error, then the particulars we report will
be mostly meaningless since they relate to hardware we're no longer
running on.  Hence the migration blocker.

However, migrating away from a (non-fatal) fwnmi error is a pretty
reasonable response, so we don't want to actually fail a migration if
its already in progress.

>    Aside: I doubt
>    the warning is going to help users.

You're probably right, but it's not very clear how to do better.  It
might possibly help someone in tech support explain why the reported
fwnmi doesn't seem to match the hardware the guest is (now) running

> 2. -EACCES: we're running with -only-migratable
>    Why may we ignore -only-migratable here?

Short answer: because I didn't think about that case.  Long answer:
I think we probably shoud ignore it anyway.  As above, receiving a
fwnmi doesn't really prevent migration, it just means that if you're
unlucky it can report stale information.  Since migrating away from a
possibly-dubious host would be a reasonable response to a non-fatal
fwnmi, I don't think we want to simply prohibit fwnmi entirely with

> By the way, we leak @local_err on failure.  I'll post a patch, but I'd
> like my question answered first.

