[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces
From: |
Michael S. Tsirkin |
Subject: |
Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions |
Date: |
Mon, 14 Nov 2011 14:17:11 +0200 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Mon, Nov 14, 2011 at 11:58:14AM +0000, Daniel P. Berrange wrote:
> On Mon, Nov 14, 2011 at 01:56:36PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Nov 14, 2011 at 11:37:27AM +0000, Daniel P. Berrange wrote:
> > > On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Nov 14, 2011 at 11:29:18AM +0000, Daniel P. Berrange wrote:
> > > > > On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote:
> > > > > > Am 14.11.2011 12:08, schrieb Daniel P. Berrange:
> > > > > > > On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin
> > > > > > > wrote:
> > > > > > >> On Mon, Nov 14, 2011 at 10:16:10AM +0000, Daniel P. Berrange
> > > > > > >> wrote:
> > > > > > >>> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
> > > > > > >>>> On 11/11/2011 12:15 PM, Kevin Wolf wrote:
> > > > > > >>>>> Am 10.11.2011 22:30, schrieb Anthony Liguori:
> > > > > > >>>>>> Live migration with qcow2 or any other image format is just
> > > > > > >>>>>> not going to work
> > > > > > >>>>>> right now even with proper clustered storage. I think doing
> > > > > > >>>>>> a block level flush
> > > > > > >>>>>> cache interface and letting block devices decide how to do
> > > > > > >>>>>> it is the best approach.
> > > > > > >>>>>
> > > > > > >>>>> I would really prefer reusing the existing open/close code.
> > > > > > >>>>> It means
> > > > > > >>>>> less (duplicated) code, is existing code that is well tested
> > > > > > >>>>> and doesn't
> > > > > > >>>>> make migration much of a special case.
> > > > > > >>>>>
> > > > > > >>>>> If you want to avoid reopening the file on the OS level, we
> > > > > > >>>>> can reopen
> > > > > > >>>>> only the topmost layer (i.e. the format, but not the
> > > > > > >>>>> protocol) for now
> > > > > > >>>>> and in 1.1 we can use bdrv_reopen().
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>> Intuitively I dislike _reopen style interfaces. If the second
> > > > > > >>>> open
> > > > > > >>>> yields different results from the first, does it invalidate any
> > > > > > >>>> computations in between?
> > > > > > >>>>
> > > > > > >>>> What's wrong with just delaying the open?
> > > > > > >>>
> > > > > > >>> If you delay the 'open' until the mgmt app issues 'cont', then
> > > > > > >>> you loose
> > > > > > >>> the ability to rollback to the source host upon open failure
> > > > > > >>> for most
> > > > > > >>> deployed versions of libvirt. We only fairly recently switched
> > > > > > >>> to a five
> > > > > > >>> stage migration handshake to cope with rollback when 'cont'
> > > > > > >>> fails.
> > > > > > >>>
> > > > > > >>> Daniel
> > > > > > >>
> > > > > > >> I guess reopen can fail as well, so this seems to me to be an
> > > > > > >> important
> > > > > > >> fix but not a blocker.
> > > > > > >
> > > > > > > If if the initial open succeeds, then it is far more likely that
> > > > > > > a later
> > > > > > > re-open will succeed too, because you have already elminated the
> > > > > > > possibility
> > > > > > > of configuration mistakes, and will have caught most storage
> > > > > > > runtime errors
> > > > > > > too. So there is a very significant difference in reliability
> > > > > > > between doing
> > > > > > > an 'open at startup + reopen at cont' vs just 'open at cont'
> > > > > > >
> > > > > > > Based on the bug reports I see, we want to be very good at
> > > > > > > detecting and
> > > > > > > gracefully handling open errors because they are pretty frequent.
> > > > > >
> > > > > > Do you have some more details on the kind of errors? Missing files,
> > > > > > permissions, something like this? Or rather something related to the
> > > > > > actual content of an image file?
> > > > >
> > > > > Missing files due to wrong/missing NFS mounts, or incorrect SAN /
> > > > > iSCSI
> > > > > setup. Access permissions due to incorrect user / group setup, or read
> > > > > only mounts, or SELinux denials. Actual I/O errors are less common and
> > > > > are not so likely to cause QEMU to fail to start any, since QEMU is
> > > > > likely to just report them to the guest OS instead.
> > > >
> > > > Do you run qemu with -S, then give a 'cont' command to start it?
> > >
> > > Yes
> >
> > OK, so let's go back one step now - how is this related to
> > 'rollback to source host'?
>
> In the old libvirt migration protocol, by the time we run 'cont' on the
> destination, the source QEMU has already been killed off, so there's
> nothing to resume on failure.
>
> Daniel
I see. So again there are two solutions I see:
1. ignore old libvirt as it can't restart source reliably anyway
2. open files when migration is completed (after startup, but before cont)
> --
> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org -o- http://virt-manager.org :|
> |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, (continued)
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Michael S. Tsirkin, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Daniel P. Berrange, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Kevin Wolf, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Daniel P. Berrange, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Michael S. Tsirkin, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Daniel P. Berrange, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Michael S. Tsirkin, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Daniel P. Berrange, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Michael S. Tsirkin, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Daniel P. Berrange, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions,
Michael S. Tsirkin <=
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Gleb Natapov, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Michael S. Tsirkin, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Anthony Liguori, 2011/11/14
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Juan Quintela, 2011/11/15
- Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Anthony Liguori, 2011/11/15
Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions, Juan Quintela, 2011/11/09