qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Open qcow2 on multiple hosts simultaneously.


From: Kevin Wolf
Subject: Re: Open qcow2 on multiple hosts simultaneously.
Date: Tue, 21 Nov 2023 11:46:24 +0100

Am 20.11.2023 um 13:41 hat kvaps geschrieben:
> Hey Alberto,
> 
> My article on this design has just been published.
> In this article I talk about the chosen technologies and the
> ReadWriteMany implementation:
> 
> https://blog.deckhouse.io/lvm-qcow-csi-driver-shared-san-kubernetes-81455201590e

That sounds like a nice proof of concept! Could you reuse some of the
code from oVirt or did you end up reimplementing something similar? Do
you have a link to a git repo that you could share?

The implementation of RWX seems dangerous to me, though. Of course,
there is the whole point that it's not actually RWX and therefore it's
easy to misuse. But even if you do use the exported block node only with
QEMU and access it from two nodes only with coordination by the live
migration protocol, it seems unclear to me how this guarantees
consistency.

The kernel page cache is one part of the problem and as you say, running
with cache.direct=on solves it. However, the qcow2 driver also keeps
metadata caches internally, and only inactivating the image on the
source host gives you the guarantee that it's safe to access on the
destination host. How did you solve this part?

> Anticipating, I would like to mention that I have tested the method of
> exporting the volume from a single QSD instance to another over the
> user network using NBD, and I faced significant performance issues.
> Additionally I would like to note that this method is overkill when
> you already have the data accessible on a backing block device via
> SAN.

Yes, going over a network connected with NBD is obviously expected to
come with some performance penalty. But it's the thing that would make
it an actual RWX implementation and give you all the guarantees that
you need to make sure that you don't corrupt the qcow2 metadata.

For the QEMU case (where you don't really do RWX), I think we could
migrate the QSD instance to the destination node very soon after
detecting that the I/O is now coming from elsewhere, so the performance
hit would be for a very short time. Maybe QEMU could even give hints to
QSD like making the connection read-only when inactivating the image on
the source, so that the switch could be made immediately based on this.

This will require some changes to QSD to actually enable it to migrate
somewhere else, but this is something we are already planning to do.

> We have opted for an approach involving switching the cache.direct
> during live migration of a virtual machine, assuming that it is not a
> full-fledged ReadWriteMany and will be used solely for the live
> migration of virtual machines.
> 
> Best Regards,
> Andrei Kvapil

Would you be interested to cooperate on integrating your code and
subprovisioner into something that could eventually expose all of the
functionality that QEMU and qcow2 enable? I think this would result in
something much more powerful than if the effort is split across multiple
smaller projects that each cover only a specific special case.

Kevin

> On Wed, Aug 16, 2023 at 11:31 AM Alberto Faria <afaria@redhat.com> wrote:
> 
> > On Mon, Jun 19, 2023 at 6:29 PM kvaps <kvapss@gmail.com> wrote:
> > > Hi Kevin and the community,
> > >
> > > I am designing a CSI driver for Kubernetes that allows efficient
> > > utilization of SAN (Storage Area Network) and supports thin
> > > provisioning, snapshots, and ReadWriteMany mode for block devices.
> > >
> > > To implement this, I have explored several technologies such as
> > > traditional LVM, LVMThin (which does not support shared mode), and
> > > QCOW2 on top of block devices. This is the same approach to what oVirt
> > > uses for thin provisioning over shared LUN:
> > >
> > > https://github.com/oVirt/vdsm/blob/08a656c/doc/thin-provisioning.md
> > >
> > > Based on benchmark results, I found that the performance degradation
> > > of block-backed QCOW2 is much lower compared to LVM and LVMThin while
> > > creating snapshots.
> > >
> > >
> > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=2020746352
> > >
> > > Therefore, I have decided to use the same aproach for Kubernetes.
> > >
> > > But in Kubernetes, the storage system needs to be self-sufficient and
> > > not depended to the workload that uses it. Thus unlike oVirt, we have
> > > no option to use the libvirt interface of the running VM to invoke the
> > > live-migration. Instead, we should provide pure block device in
> > > ReadWriteMany mode, where the block device can be writable on multiple
> > > hosts simultaneously.
> > >
> > > To achieve this, I decided to use the qemu-storage-daemon with the
> > > VDUSE backend.
> > >
> > > Other technologies, such as NBD and UBLK, were also considered, and
> > > their benchmark results can be seen in the same document on the
> > > different sheet:
> > >
> > >
> > https://docs.google.com/spreadsheets/d/1mppSKhEevGl5ntBhZT3ccU5t07LwxXjQz1HM2uvBIuo/edit#gid=416958126
> > >
> > > Taking into account the performance, stability, and versatility, I
> > > concluded that VDUSE is the optimal choice. To connect the device in
> > > Kubernetes, the virtio-vdpa interface would be used, and the entire
> > > scheme could look like this:
> > >
> > >
> > > +---------------------+  +---------------------+
> > > | node1               |  | node2               |
> > > |                     |  |                     |
> > > |    +-----------+    |  |    +-----------+    |
> > > |    | /dev/vda  |    |  |    | /dev/vda  |    |
> > > |    +-----+-----+    |  |    +-----+-----+    |
> > > |          |          |  |          |          |
> > > |     virtio-vdpa     |  |     virtio-vdpa     |
> > > |          |          |  |          |          |
> > > |        vduse        |  |        vduse        |
> > > |          |          |  |          |          |
> > > | qemu-storage-daemon |  | qemu-storage-daemon |
> > > |          |          |  |          |          |
> > > | +------- | -------+ |  | +------- | -------+ |
> > > | | LUN    |        | |  | | LUN    |        | |
> > > | |  +-----+-----+  | |  | |  +-----+-----+  | |
> > > | |  | LV (qcow2)|  | |  | |  | LV (qcow2)|  | |
> > > | |  +-----------+  | |  | |  +-----------+  | |
> > > | +--------+--------+ |  | +--------+--------+ |
> > > |          |          |  |          |          |
> > > |          |          |  |          |          |
> > > +--------- | ---------+  +--------- | ---------+
> > >            |                        |
> > >            |         +-----+        |
> > >            +---------| SAN |--------+
> > >                      +-----+
> > >
> > > Despite two independent instances of qemu-storage-daemon for same
> > > qcow2 disk running successfully on different hosts, I have concerns
> > > about their proper functioning. Similar to live migration, I think
> > > they should share the state between each other.
> > >
> > > The question is how to make qemu-storage-daemon to share the state
> > > between multiple nodes, or is qcow2 format inherently stateless and
> > > does not requires this?
> > >
> > > --
> > > Best Regards,
> > > Andrei Kvapil
> >
> > Hi Andrei,
> >
> > Apologies for not getting back to you sooner.
> >
> > Have you made progress on this?
> >
> > AIUI, and as others have mentioned, it's not possible to safely access
> > a qcow2 file from more than one qemu-storage-daemon (qsd) instance at
> > once. Disabling caching might help ensure consistency of the image's
> > data, but there would still be no synchronization between the qsd
> > instances when they are manipulating qcow2 metadata.
> >
> > ReadWriteMany block volumes are something that we would eventually
> > like to support in Subprovisioner [1], for instance so KubeVirt live
> > migration can work with it. The best we have come up with is to export
> > the volume from a single qsd instance over the network using NBD,
> > whenever more than one node has the volume mounted. This means that
> > all but one node would be accessing the volume with degraded
> > performance, but that may be acceptable for use cases like KubeVirt
> > live migration. We would then somehow migrate the qsd instance from
> > the source node to the destination node whenever the former unmounts
> > it, so that the migrated VM can access the volume with full
> > performance. This may require adding live migration support to qsd
> > itself.
> >
> > What are your thoughts on this approach?
> >
> > Thanks,
> > Alberto
> >
> > [1] https://gitlab.com/subprovisioner/subprovisioner
> >
> >




reply via email to

[Prev in Thread] Current Thread [Next in Thread]