qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] migration: qemu-coroutine-lock.c:141: qemu_co_mutex_unl


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] migration: qemu-coroutine-lock.c:141: qemu_co_mutex_unlock: Assertion `mutex->locked == 1' failed
Date: Wed, 17 Sep 2014 16:04:23 +0100

On Wed, Sep 17, 2014 at 10:25 AM, Paolo Bonzini <address@hidden> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Il 17/09/2014 11:06, Stefan Hajnoczi ha scritto:
>> I think the fundamental problem here is that the mirror block job
>> on the source host does not synchronize with live migration.
>>
>> Remember the mirror block job iterates on the dirty bitmap
>> whenever it feels like.
>>
>> There is no guarantee that the mirror block job has quiesced before
>> migration handover takes place, right?
>
> Libvirt does that.  Migration is started only once storage mirroring
> is out of the bulk phase, and the handover looks like:
>
> 1) migration completes
>
> 2) because the source VM is stopped, the disk has quiesced on the source

But the mirror block job might still be writing out dirty blocks.

> 3) libvirt sends block-job-complete

No, it sends block-job-cancel after the source QEMU's migration has
completed.  See the qemuMigrationCancelDriveMirror() call in
src/qemu/qemu_migration.c:qemuMigrationRun().

> 4) libvirt receives BLOCK_JOB_COMPLETED.  The disk has now quiesced on
> the destination as well.

I don't see where this happens in the libvirt source code.  Libvirt
doesn't care about block job events for drive-mirror during migration.

And that's why there could still be I/O going on (since
block-job-cancel is asynchronous).

> 5) the VM is started on the destination
>
> 6) the NBD server is stopped on the destination and the source VM is quit.
>
> It is actually a feature that storage migration is completed
> asynchronously with respect to RAM migration.  The problem is that
> qcow2_invalidate_cache happens between (3) and (5), and it doesn't
> like the concurrent I/O received by the NBD server.

I agree that qcow2_invalidate_cache() (and any other invalidate cache
implementations) need to allow concurrent I/O requests.

Either I'm misreading the libvirt code or libvirt is not actually
ensuring that the block job on the source has cancelled/completed
before the guest is resumed on the destination.  So I think there is
still a bug, maybe Eric can verify this?

Stefan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]