[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [RFC v2] new, node-graph-based fleecing and backup

From: Vladimir Sementsov-Ogievskiy
Subject: Re: [Qemu-block] [RFC v2] new, node-graph-based fleecing and backup
Date: Mon, 20 Aug 2018 12:42:34 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0

18.08.2018 00:50, Max Reitz wrote:
On 2018-08-14 19:01, Vladimir Sementsov-Ogievskiy wrote:
Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>

[v2 is just a resend. I forget to add Den an me to cc, and I don't see the
letter in my thunderbird at all. strange. sorry for that]

Hi all!

Here is an idea and kind of proof-of-concept of how to unify and improve
push/pull backup schemes.

Let's start from fleecing, a way of importing a point-in-time snapshot not
creating a real snapshot. Now we do it with help of backup(sync=none)..


For fleecing we need two nodes:

1. fleecing hook. It's a filter which should be inserted on top of active
disk. It's main purpose is handling guest writes by copy-on-write operation,
i.e. it's a substitution for write-notifier in backup job.

2. fleecing cache. It's a target node for COW operations by fleecing-hook.
It also represents a point-in-time snapshot of active disk for the readers.
It's not really COW, it's copy-before-write, isn't it?  It's something
else entirely.  COW is about writing data to an overlay *instead* of
writing it to the backing file.  Ideally, you don't copy anything,
actually.  It's just a side effect that you need to copy things if your
cluster size doesn't happen to match exactly what you're overwriting.

Hmm. I'm not against. But COW term was already used in backup to describe this.

CBW is about copying everything to the overlay, and then leaving it
alone, instead writing the data to the backing file.

I'm not sure how important it is, I just wanted to make a note so we
don't misunderstand what's going on, somehow.

The fleecing hook sounds good to me, but I'm asking myself why we don't
just add that behavior to the backup filter node.  That is, re-implement
backup without before-write notifiers by making the filter node actually
do something (I think there was some reason, but I don't remember).

fleecing don't need any block-job at all, so, I think it is good to have fleecing filter
to be separate. And then, it should be reused by internal backup.

Hm, we can call this backup-filter instead of fleecing-hook, what is the difference?

The simplest realization of fleecing cache is a qcow2 temporary image, backed
by active disk, i.e.:

    | Guest |
    +---+-----------+  file     +-----------------------+
    | Fleecing hook +---------->+ Fleecing cache(qcow2) |
    +---+-----------+           +---+-------------------+
        |                           |
backing |                           |
        v                           |
    +---+---------+      backing    |
    | Active disk +<----------------+

Hm. No, because of permissions I can't do so, I have to do like this:

    | Guest |
    +---+-----------+  file     +-----------------------+
    | Fleecing hook +---------->+ Fleecing cache(qcow2) |
    +---+-----------+           +-----+-----------------+
        |                             |
backing |                             | backing
        v                             v
    +---+---------+   backing   +-----+---------------------+
    | Active disk +<------------+ hack children permissions |
    +-------------+             |     filter node           |

Ok, this works, it's an image fleecing scheme without any block jobs.
So this is the goal?  Hm.  How useful is that really?

I suppose technically you could allow blockdev-add'ing a backup filter
node (though only with sync=none) and that would give you the same.

what is backup filter node?

Problems with realization:

1 What to do with hack-permissions-node? What is a true way to implement
something like this? How to tune permissions to avoid this additional node?
Hm, how is that different from what we currently do?  Because the block
job takes care of it?

1. As I understand, we agreed, that it is good to use filter node instead of write_notifier.
2. We already have fleecing scheme, when we should create some subgraph between nodes.
3. If we move to filter-node instead of write_notifier, block job is not actually needed for fleecing, and it is good to drop it from the fleecing scheme, to simplify it, to make it more clear and transparent.
And finally, we will have unified filter-node-based scheme for backup and fleecing, modular and customisable.

Well, the user would have to guarantee the permissions.  And they can
only do that by manually adding a filter node in the backing chain, I

Or they just start a block job which guarantees the permissions work...
So maybe it's best to just stay with a block job as it is.

2 Inserting/removing the filter. Do we have working way or developments on
Berto has posted patches for an x-blockdev-reopen QMP command.

3. Interesting: we can't setup backing link to active disk before inserting
fleecing-hook, otherwise, it will damage this link on insertion. This means,
that we can't create fleecing cache node in advance with all backing to
reference it when creating fleecing hook. And we can't prepare all the nodes
in advance and then insert the filter.. We have to:
1. create all the nodes with all links in one big json, or
I think that should be possible with x-blockdev-reopen.

2. set backing links/create nodes automatically, as it is done in this RFC
 (it's a bad way I think, not clear, not transparent)

4. Is it a good idea to use "backing" and "file" links in such way?
I don't think so, because you're pretending it to be a COW relationship
when it isn't.  Using backing for what it is is kind of OK (because
that's what the mirror and backup filters do, too), but then using
"file" additionally is a bit weird.

(Usually, "backing" refers to a filtered node with COW, and "file" then
refers to the node where the overlay driver stores its data and
metadata.  But you'd store old data there (instead of new data), and no

Benefits, or, what can be done:

1. We can implement special Fleecing cache filter driver, which will be a real
cache: it will store some recently written clusters and RAM, it can have a
backing (or file?) qcow2 child, to flush some clusters to the disk, etc. So,
for each cluster of active disk we will have the following characteristics:

- changed (changed in active disk since backup start)
- copy (we need this cluster for fleecing user. For example, in RFC patch all
clusters are "copy", cow_bitmap is initialized to all ones. We can use some
existent bitmap to initialize cow_bitmap, and it will provide an "incremental"
fleecing (for use in incremental backup push or pull)
- cached in RAM
- cached in disk
Would it be possible to implement such a filter driver that could just
be used as a backup target?

for internal backup we need backup-job anyway, and we will be able to create different schemes.
One of my goals is the scheme, when we store old data from CBW operations into local cache, when
backup target is remote, relatively slow NBD node. In this case, cache is backup source, not target.

On top of these characteristics we can implement the following features:

1. COR, we can cache clusters not only on writes but on reads too, if we have
free space in ram-cache (and if not, do not cache at all, don't write to
disk-cache). It may be done like bdrv_write(..., BDRV_REQ_UNNECESARY)
You can do the same with backup by just putting a fast overlay between
source and the backup, if your source is so slow, and then do COR, i.e.:

slow source --> fast overlay --> COR node --> backup filter

How will we check ram-cache size to make COR optional in this scheme?

2. Benefit for guest: if cluster is unchanged and ram-cached, we can skip reading
from the devise

3. If needed, we can drop unchanged ram-cached clusters from ram-cache

4. On guest write, if cluster is already cached, we just mark it "changed"

5. Lazy discards: in some setups, discards are not guaranteed to do something,
so, we can at least defer some discards to the end of backup, if ram-cache is

6. We can implement discard operation in fleecing cache, to make cluster
not needed (drop from cache, drop "copy" flag), so further reads of this
cluster will return error. So, fleecing client may read cluster by cluster
and discard them to reduce COW-load of the drive. We even can combine read
and discard into one command, something like "read-once", or it may be a
flag for fleecing-cache, that all reads are "read-once".
That would definitely be possible with a dedicated fleecing backup
target filter (and normal backup).

target-filter schemes will not work for external-backup..

7. We can provide recommendations, on which clusters should fleecing-client
copy first. Examples:
a. copy ram-cached clusters first (obvious, to unload cache and reduce io
b. copy zero-clusters last (the don't occupy place in cache, so, lets copy
   other clusters first)
c. copy disk-cached clusters list (if we don't care about disk space,
   we can say, that for disk-cached clusters we already have a maximum
   io overhead, so let's copy other clusters first)
d. copy disk-cached clusters with high priority (but after ram-cached) -
   if we don't have enough disk space

So, there is a wide range of possible politics. How to provide these
1. block_status
2. create separate interface
3. internal backup job may access shared fleecing object directly.
Hm, this is a completely different question now.  Sure, extending backup
or mirror (or a future blockdev-copy) would make it easiest for us.  But
then again, if you want to copy data off a point-in-time snapshot of a
volume, you can just use normal backup anyway, right?

right. but how to implement all the features I listed? I see the way to implement them with help of two special filters. And backup job will be used anyway (without write-notifiers) for internal backup and will not be used for external backup (fleecing).

So I'd say the purpose of fleecing is that you have an external tool
make use of it.  Since my impression was that you'd just access the
volume externally and wouldn't actually copy all of the data off of it

not quite right. People use fleecing to implement external backup, managed by their third-party tool, which they want to use instead of internal backup. And they do copy all the data. I cant describe all the reasons, but example is custom storage for backup, which external tool can manage and Qemu can't.
So, fleecing is used for external backups (or pull backups).

(because that's what you could use the backup job for), I don't think I
can say much here, because my impression seems to have been wrong.

About internal backup:
Of course, we need a job which will copy clusters. But it will be simplified:
So you want to completely rebuild backup based on the fact that you
specifically have fleecing now?

I need several features, which are hard to implement using current scheme.

1. The scheme when we have a local cache as COW target and slow remote backup target.
How to do it now? Using two backups, one with sync=none... Not sure that this is right way.

2. Then, we'll need support for bitmaps in backup (sync=none). 3. Then, we'll need a possibility for backup(sync=none) to
not COW clusters, which are already copied to backup, and so on.

If we want a backup-filter anyway, why not to implement some cool features on top of it?

I don't think that will be any simpler.

I mean, it would make blockdev-copy simpler, because we could
immediately replace backup by mirror, and then we just have mirror,
which would then automatically become blockdev-copy...

But it's not really going to be simpler, because whether you put the
copy-before-write logic into a dedicated block driver, or into the
backup filter driver, doesn't really make it simpler either way.  Well,
adding a new driver always is a bit more complicated, so there's that.

what is the difference between separate filter driver and backup filter driver?

it should not care about guest writes, it copies clusters from a kind of
snapshot which is not changing in time. This job should follow recommendations
from fleecing scheme [7].

What about the target?

We can use separate node as target, and copy from fleecing cache to the target.
If we have only ram-cache, it would be equal to current approach (data is copied
directly to the target, even on COW). If we have both ram- and disk- caches, it's
a cool solution for slow-target: instead of make guest wait for long write to
backup target (when ram-cache is full) we can write to disk-cache which is local
and fast.
Or you backup to a fast overlay over a slow target, and run a live
commit on the side.

I think it will lead to larger io overhead: all clusters will go through overlay, not only guest-written clusters, for which we did not have time to copy them..

Another option is to combine fleecing cache and target somehow (I didn't think
about this really).

Finally, with one - two (three?) special filters we can implement all current
fleecing/backup schemes in unique and very configurable way  and do a lot more
cool features and possibilities.

What do you think?
I think adding a specific fleecing target filter makes sense because you
gave many reasons for interesting new use cases that could emerge from that.

But I think adding a new fleecing-hook driver just means moving the
implementation from backup to that new driver.

But in the same time you say that it's ok to create backup-filter (instead of write_notifier) and make it insertable by qapi? So, if I implement it in block/backup, it's ok? Why not do it separately?


I really need help with fleecing graph creating/inserting/destroying, my code
about it is a hack, I don't like it, it just works.

About testing: to show that this work I use existing fleecing test - 222, a bit
tuned (drop block-job and use new qmp command to remove filter).


Best regards,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]