qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Block Filters


From: Fam Zheng
Subject: Re: [Qemu-devel] Block Filters
Date: Fri, 6 Sep 2013 17:18:20 +0800
User-agent: Mutt/1.5.21 (2010-09-15)

On Fri, 09/06 10:45, Kevin Wolf wrote:
> Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > On Tue, 09/03 18:24, Benoît Canet wrote:
> > > 
> > > Hello list,
> > > 
> > > I am thinking about QEMU block filters lately.
> > > 
> > > I am not a block.c/blockdev.c expert so tell me what you think of the 
> > > following.
> > > 
> > > The use cases I see would be:
> > > 
> > > -$user want to have some real cryptography on top of qcow2/qed or another
> > > format.
> > >  snapshots and other block features should continue to work
> > > 
> > > -$user want to use a raid like feature like QUORUM in QEMU.
> > >  other features should continue to work
> > > 
> > > -$user want to use the future SSD deduplication implementation with 
> > > metadata on
> > > SSD and data on spinning disks.
> > >  other features should continue to work
> > > 
> > > -$user want to I/O throttle one drive of his vm.
> > > 
> > > -$user want to do Copy On Read
> > > 
> > > -$user want to do a combination of the above
> > > 
> > > -$developer want to make the minimum of required steps to keep changes 
> > > small
> > > 
> > > -$developer want to keep user interface changes for later
> > > 
> > > Lets take a example case of an user wanting to do I/O throttled encrypted 
> > > QUORUM
> > > on top of QCOW2.
> > > 
> > > Assuming we want to implement throttle and encryption as something 
> > > remotely
> > > being like a block filter this makes a pretty complex BlockDriverState 
> > > tree.
> > > 
> > > The tree would look like the following:
> > > 
> > >                     I/O throttling BlockDriverState (bs)
> > >                                |
> > >                                |
> > >                                |
> > >                                |
> > >                     Encryption BlockDriverState (bs)
> > >                                |
> > >                                |
> > >                                |
> > >                                |
> > >                     Quorum BlockDriverState (bs)
> > >                    /           |           \
> > >                   /            |            \
> > >                  /             |             \
> > >                 /              |              \
> > >             QCOW2 bs       QCOW2 b s       QCOW2 bs
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >             RAW bs         RAW bs           RAW bs
> > > 
> > > An external snapshot should result in a tree like the following.
> > >                     I/O throttling BlockDriverState (bs)
> > >                                |
> > >                                |
> > >                                |
> > >                                |
> > >                     Encryption BlockDriverState (bs)
> > >                                |
> > >                                |
> > >                                |
> > >                                |
> > >                     Quorum BlockDriverState (bs)
> > >                    /           |           \
> > >                   /            |            \
> > >                  /             |             \
> > >                 /              |              \
> > >             QCOW2 bs       QCOW2 bs         QCOW2 bs
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >             QCOW2 bs       QCOW2 bs         QCOW2 bs
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >                |               |               |
> > >             RAW bs         RAW bs           RAW bs
> > > 
> > > In the current state of QEMU we can code some block drivers to implement 
> > > this
> > > tree.
> > > 
> > > However when doing operations like snapshots blockdev.c would have no 
> > > real idea
> > > of what should be snapshotted and how. (The 3 top bs should be kept on 
> > > top)
> > > 
> > > Moreover it would have no way to manipulate easily this tree of 
> > > BlockDriverState
> > > has each one is encapsulated in it's parent.
> > > 
> > > Also there no generic way to tell the block layer that two or more 
> > > BlockDriverState
> > > are siblings.
> > > 
> > > The current mail is here to propose some additionals structures in order 
> > > to cope
> > > with these problems.
> > > 
> > > The overall strategy of the proposed structures is to push out the
> > > BlockDriverStates relationships out of each BlockDriverState.
> > > 
> > > The idea is that it would make it easier for the block layer to 
> > > manipulate a
> > > well known structure instead of being forced to enter into each 
> > > BlockDriverState
> > > specificity.
> > > 
> > > The first structure is the BlockStackNode.
> > > 
> > > The BlockStateNode would be used to represent the relationship between the
> > > various BlockDriverStates
> > > 
> > > struct BlockStackNode {
> > >     BlockDriverState *bs;  /* the BlockDriverState holded by this node */
> > > 
> > >     /* this doubly linked list entry points to the child node and the 
> > > parent
> > >      * node
> > >      */
> > >     QLIST_ENTRY(BlockStateNode) down;
> > > 
> > >     /* This doubly linked list entry point to the siblings of this node
> > >      */
> > >     QLIST_ENTRY(BlockStateNode) siblings;
> > > 
> > >     /* a hash or an array of the sibbling of this node for fast access
> > >      * should be recomputed when updating the tree */
> > >     QHASH_ENTRY<BlockStateNode, index> sibblings_hash;
> > > }
> > > 
> > > The BlockBackend would be the structure used to hold the "drive" the 
> > > guest use.
> > > 
> > > struct BlockBackend {
> > >     /* the following doubly linked list header point to the top 
> > > BlockStackNode
> > >      * in our case it's the one containing the I/O throttling bs
> > >      */
> > >     QLIST_HEAD(, BlockStateNode) block_stack_head;
> > >     /* this is a pointer to the topest node below the block filter chain
> > >      * in our case the first QCOW2 sibling
> > >      */
> > >     BlockStackNode *top_node_below_filters;
> > > }
> > > 
> > > 
> > > Updated diagram:
> > > 
> > > (Here bsn means BlockStacknode)
> > > 
> > >     ------------------------BlockBackend
> > >     |                             |
> > >     |                          block_stack_head
> > >     |                             |
> > >     |                             |
> > >     |                       I/O throttling BlockStackNode (contains it's 
> > > bs)
> > >     |                             |
> > >     |                            down
> > >     |                             |
> > >     |                             |
> > > top_node_below_filter     Encryption BlockStacknode (contains it's bs)
> > >     |                             |
> > >     |                            down
> > >     |                             |
> > >     |                             |
> > >     |                Quorum BlockStackNode (contain's it's bs)
> > >     |               /
> > >     |             down
> > >     |             /               
> > >     |            /     S              S
> > >     ------  QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn 
> > > contains a bs)
> > >                |       b       |      b         |
> > >              down      l      down    l        down
> > >                |       i       |      i         |
> > >                |       n       |      n         |
> > >                |       g       |      g         |
> > >                |       s       |      s         |
> > >                |               |                |
> > >             RAW bsn         RAW bsn           RAW bsn  (each bsn contains 
> > > a bs)
> > > 
> > > 
> > > Block driver point of view:
> > > 
> > > to construct the tree each BlockDriver would have some utility functions 
> > > looking
> > > like.
> > > 
> > > bdrv_register_child_bs(bs, child_bs, int index);
> > > 
> > > multiples calls to this function could be done to register multiple 
> > > siblings
> > > childs identified by their index.
> > > 
> > > This way something like quorum could register multiple QCOW2 instances.
> > > 
> > > driver would have a
> > > BlockDriverSTate *bdrv_access_child(bs, int index);
> > > 
> > > to access their childs.
> > > 
> > > These functions can be implemented without the driver knowing about
> > > BlockStateNodes using container_of.
> > > 
> > > blockdev point of view: (here I need your help)
> > > 
> > > When doing a snapshot blockdev.c would access
> > > BlockBackend->top_node_below_filter and make a snapshot of the bs 
> > > contained in
> > > this node and it's sibblings.
> > > 
> > Since BlockDriver.bdrv_snapshot_create() is an optional operation, 
> > blockdev.c
> > can navigate down the tree from top node, until hitting some layer where 
> > the op
> > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > pointer.
> 
> Is it even inherent to a block driver (like a filter), if a snapshot is
> to be taken at its level? Or is it rather a policy decision that should
> be made by the user?
> 
OK, getting the point that user should have full flexibility and fine operation
granularity. It also stands against block_backend->top_node_below_filter. Do we
really have the assumption that all the filters are on top of the tree and 
linear?
Shouldn't this be possible?

                   Block Backend
                         |
                         |
                    Quodrum BDS
                    /    |    \
             iot filter  |     \
                  /      |      \
                qcow2   qcow2   qcow2

So we throttle only a particular image, not the whole device. But this will
make a top_node_below_filter pointer impossible.

> In our example, the quorum driver, it's not at all clear to me that you
> want to snapshot all children. In order to roll back to a previous
> state, one snapshot is enough, you don't need multiple copies of the
> same one. Perhaps you want two so that we can still compare them for
> verification. Or all of them because you can afford the disk space and
> want ultimate safety. I don't think qemu can know which one is true.
> 
Only if quorum ever knows about and operates on snapshots, it should be
considered specifically, but no. So we need to achieve this in the general
design: allow user to take snapshot, or set throttle limits on particular
BDSes, as above graph.

> In the same way, in a typical case you may want to keep I/O throttling
> for the whole drive, including the new snapshot. But what if the
> throttling was used in order to not overload the network where the image
> is stored, and you're now doing a local snapshot, to which you want to
> stream the image? The I/O throttling should apply only to the backing
> file, not the new snapshot.
> 
Yes, and OTOH, throttling really suits to be a filter only if it can be a non
top one, otherwise it's no better than what we have now.

> So perhaps what we really need is a more flexible snapshot/BDS tree
> manipulation command that describes in detail which structure you want
> to have in the end.
> 

Fam



reply via email to

[Prev in Thread] Current Thread [Next in Thread]