qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH v2 3/8] fdc: Introduce fdctrl->phas


From: Markus Armbruster
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH v2 3/8] fdc: Introduce fdctrl->phase
Date: Mon, 01 Jun 2015 14:46:47 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)

"Dr. David Alan Gilbert" <address@hidden> writes:

> * Markus Armbruster (address@hidden) wrote:
>> "Dr. David Alan Gilbert" <address@hidden> writes:
>> 
>> > * John Snow (address@hidden) wrote:
>> >> 
>> >> 
>> >> On 05/21/2015 09:19 AM, Kevin Wolf wrote:
>> >> > The floppy controller spec describes three different controller phases,
>> >> > which are currently not explicitly modelled in our emulation. Instead,
>> >> > each phase is represented by a combination of flags in registers.
>> >> > 
>> >> > This patch makes explicit in which phase the controller currently is.
>> >> > 
>> >> > Signed-off-by: Kevin Wolf <address@hidden>
>
> <snip>
>
>> >> > +static bool fdc_phase_needed(void *opaque)
>> >> > +{
>> >> > +    FDCtrl *fdctrl = opaque;
>> >> > +
>> >> > +    return reconstruct_phase(fdctrl) != fdctrl->phase;
>> >> > +}
>> >
>> > OK, that is one way of doing it which should work, as long
>> > as most of the time that matches.
>> >
>> > My preference for subsections is normally to make them just
>> > conditional on machine type, that way backwards migration just
>> > works.  However, if the result of the backwards migration would
>> > be data corruption (which if I understand what you saying it could
>> > in this case) then it's arguably better to fail migration in those
>> > cases.
>> 
>> This isn't a matter of preference.
>> 
>> Subsections were designed to solves a problem we only have because we're
>> not perfect: device model bugs, namely forgetting to migrate a bit of
>> state, or failing to model it in the first place.
>> 
>> Subsections tied to machine types solve an entirely different problem:
>> the old version of the device is just fine, including migration, but we
>> now want a new and improved variant, which needs some more state
>> migrated.  Putting that piece of state into a subsection tied to the new
>> machine type avoids code duplication.
>> 
>> But what do you do for device model bugs serious enough to need fixing
>> even for existing machine types, when the fix needs additional state
>> migrated?  Subsections tied to machine types are *not* a solution there.
>> That's why this isn't about preference!  It's about having a bug to fix
>> or not.
>
> My problem is that people keep adding subsections to fix minor device
> bugs because they think breaking migration is irrelevant.  If the bug
> being fixed causes data corruption then OK I agree it is probably better
> to break migration, otherwise  you need to make a decision about whether
> fixing the device bug during migration is better or worse than having
> the migration fail.
> Some people say 'Migration failure is ok - people just try it again';
> sorry, no, that's useless in environments where the VM has 100s of GB
> of RAM and takes hours to migrate only for it then to fail, and it
> was urgent that it was migrated off that source host.
>
>> You seem rather reluctant to use subsections for their intended purpose.
>> I don't understand; their correctness argument is really simple.
>
> Correct, I am; it encourages people to break migration compatibility
> too strongly.
>
> Now, since I have to worry about backwards-migration downstream
> this influences my choice.   People keep telling me that upstream
> doesn't care about backwards migration compatibility, but I've not
> seen any evidence of that,

If we cared about backwards migration compatibility, we'd test it.

In my opinion, we do care about infrastructure for making backwards
migration possible, but that's about it.

>                            indeed I've seen other people trying
> to put backwards migration fixes in.

Doesn't quite rise to a level I'd call "care".

>> Fundamentally, we're picking a pair of state serialization and
>> deserialization functions (in the mathematical sense).
>> 
>> The obvious encoding function for a state consisting of pieces is to
>> encode all the pieces one by one.  Call this e(), and its decoding
>> function d().
>> 
>> Here's an alternative pair e'() and d'():
>> 
>> * Pick a piece of the state.
>> 
>> * Pick one of its values.  Let's call it "the prevalent value" for
>>   reasons that will become apparent shortly.
>> 
>> * Encode just like e() does, except omit this piece of state if and only
>>   if it has its prevalent value.
>> 
>> * In the decoding function, default this piece of state to its prevalent
>>   value.
>> 
>> The encodings are equivalent: d(e(x)) = d'(e'(x)) for all x.
>> 
>> At this point, you might want to tell me not to waste your time with
>> trivial math.  If so, good!  The triviality is exactly my point.
>
> Well I was going to mention it, but since you already have I wont bother....
> Actually, the problem here is that the symbols look trivial but they're
> meaningless without the following textual description, and that's
> where the problems are.
>
>> Straightforwardly translated to code, e() and d() put everything in the
>> section.  Drawback: add/remove/modify of state kills migration to and
>> from older versions.
>> 
>> e'() and d'() put a piece of state in a subsection that is only present
>> when the piece doesn't have its prevalent value.
>> 
>> Because the encodings are equivalent, forward migration transfers
>> exactly the same state regardless of which of them we choose.
>> 
>> So why not choose one that makes backward migration work exactly when
>> it's safe?
>> 
>> * Put all pieces of state the old version doesn't understand in
>>   subsections.
>> 
>> * Identify prevalent state.
>> 
>> * Double-check the old version behaves safely in prevalent state.
>> 
>> * In prevalent state, no subsections are sent, and migration succeeds.
>> 
>> * In any other state, migration fails.
>> 
>> Naturally, this is useful only when the prevalent state is actually
>> prevalent.
>
> The 'prevalent state' idea is what I hate about the use of subsections.
>
> It's very very difficult to reason about:
>    1) subtle changes in either the guest OS or the implementation can
>       change what the 'prevalent state' is.
>    2) Are you sure that the prevalent state is the same on Windows n+1?
>    3) What about during boot?
>    4) What about during installation?
>
> It's then roulette as to whether a migration will work.
> It's also a nightmare to test, having bugs where migration fails 1/100
> times on some random OS in a production environment is horrible, and
> that's exactly what the 'prevalent state' design produces.

Thanks for explaining your thinking, I believe I understand it better
now.

The rationale behind the "prevalent state" design is that failing
migration outright is better than letting it silently corrupt state.  I
still think that makes some sense.  However:

> I work to the rule that:
>    a) If the migration bug is such that the effect of the missing state
>       is really bad; a hung guest, or data corruption - then sure
>       use a subsection based on a 'prevalent state' decision.
>
>    b) Else tie it to the machine type.

You're right in that state corruption isn't necessarily catastrophic.
When its impact is minor, availability (backward migration succeeds)
*can* be preferable to correctness (state is exactly preserved under
backward and forward migration).

In other words: I agree with you there's a tradeoff to be made.

>    c) A combination is even better; i.e. always send the state
>       on new machine types, just because it makes the behaviour
>       more consistent.

I don't like (c), because it adds a second "needed" predicate.  Granted,
the new one's as trivial as can be, but X + trivial is still more
complicated than just X.

[...]



reply via email to

[Prev in Thread] Current Thread [Next in Thread]