20.08.2018 20:25, Max Reitz wrote:
On 2018-08-20 16:49, Vladimir Sementsov-Ogievskiy wrote:
20.08.2018 16:32, Max Reitz wrote:
On 2018-08-20 11:42, Vladimir Sementsov-Ogievskiy wrote:
18.08.2018 00:50, Max Reitz wrote:
On 2018-08-14 19:01, Vladimir Sementsov-Ogievskiy wrote:
It's not really COW, it's copy-before-write, isn't it? It's
For fleecing we need two nodes:
1. fleecing hook. It's a filter which should be inserted on top
disk. It's main purpose is handling guest writes by
i.e. it's a substitution for write-notifier in backup job.
2. fleecing cache. It's a target node for COW operations by
It also represents a point-in-time snapshot of active disk for
else entirely. COW is about writing data to an overlay *instead* of
writing it to the backing file. Ideally, you don't copy anything,
actually. It's just a side effect that you need to copy things
cluster size doesn't happen to match exactly what you're
Hmm. I'm not against. But COW term was already used in backup to
Bad enough. :-)
So, we agreed about new "CBW" abbreviation? :)
It is already used for the USB mass-storage command block wrapper, but I
suppose that is sufficiently different not to cause much confusion. :-)
(Or at least that's the only other use I know of.)
now it uses backup job, with shared_perm = all for its source and
2. We already have fleecing scheme, when we should create some
Yes, but how do the permissions work right now, and why wouldn't they
work with your schema?
So the issue is... Hm, what exactly? The backup node probably doesn't
want to share WRITE for the source anymore, as there is no real point in
doing so. And for the target, the only problem may be to share
CONSISTENT_READ. It is OK to share that in the fleecing case, but in
other cases maybe it isn't. But that's easy enough to distinguish in
The main issue I could see is that the overlay (the fleecing target)
might not share write permissions on its backing file (the fleecing
source)... But your diagram shows (and bdrv_format_default_perms() as
well) that this is no the case, when the overlay is writable, the
backing file may be written to, too.
Hm, actually overlay may share write permission to clusters which are
saved in overlay, or which are not needed (if we have dirty bitmap for
incremental backup).. But we don't have such permission kind, and it
looks not easy to implement it... And it may be too expensive in
(ha, you can look at the picture in "[PATCH v2 0/3] block nodes
3. If we move to filter-node instead of write_notifier, block job
actually needed for fleecing, and it is good to drop it from the
fleecing scheme, to simplify it, to make it more clear and
If that's possible, why not. But again, I'm not sure whether that's
enough of a reason for the endavour, because whether you start a block
job or do some graph manipulation yourself is not really a
not "or" but "and": in current fleecing scheme we do both graph
manipulations and block-job stat/cancel..
Hm! Interesting. I didn't know blockdev-backup didn't set the target's
backing file. It makes sense, but I didn't think about it.
Well, still, my point was whether you do a blockdev-backup +
block-job-cancel, or a blockdev-add + blockdev-reopen + blockdev-reopen
+ blockdev-del... If there is a difference, the former is going to be
(But if there are things you can't do with the current blockdev-backup,
then, well, that doesn't help you.)
Yes, I agree, that there no real benefit in difficulty. I just thing,
that if we have filter node which performs "CBW" operations, block-job
backup(sync=none) becomes actually empty, it will do nothing.
On the code side, yes, that's true.
But it's mostly your call, since I suppose you'd be doing most of
And finally, we will have unified filter-node-based scheme for backup
and fleecing, modular and customisable.
Would it be possible to implement such a filter driver that could
Benefits, or, what can be done:
1. We can implement special Fleecing cache filter driver, which
will be a real
cache: it will store some recently written clusters and RAM, it
can have a
backing (or file?) qcow2 child, to flush some clusters to the
disk, etc. So,
for each cluster of active disk we will have the following
- changed (changed in active disk since backup start)
- copy (we need this cluster for fleecing user. For example, in
RFC patch all
clusters are "copy", cow_bitmap is initialized to all ones. We
can use some
existent bitmap to initialize cow_bitmap, and it will provide an
fleecing (for use in incremental backup push or pull)
- cached in RAM
- cached in disk
be used as a backup target?
for internal backup we need backup-job anyway, and we will be able to
create different schemes.
One of my goals is the scheme, when we store old data from CBW
operations into local cache, when
backup target is remote, relatively slow NBD node. In this case,
is backup source, not target.
Sorry, my question was badly worded. My main point was whether you
could implement the filter driver in such a generic way that it
depend on the fleecing-hook.
yes, I want my filter nodes to be self-sufficient entities. However it
may be more effective to have some shared data, between them, for
example, dirty-bitmaps, specifying drive clusters, to know which
clusters are cached, which are changed, etc.
I suppose having global dirty bitmaps may make sense.
Judging from your answer and from the fact that you proposed
filter node backup-filter and just using it for all backups, I suppose
the answer is "yes". So that's good.
(Though I didn't quite understand why in your example the cache
the backup source, when the target is the slow node...)
cache is a point-in-time view of active disk (actual source) for
fleecing. So, we can start backup job to copy data from cache to
But wouldn't the cache need to be the immediate fleecing target for
this? (And then you'd run another backup/mirror from it to copy the
whole disk to the real target.)
Yes, the cache is immediate fleecing target.
Yes, well, if you have a caching driver already, I suppose you can
On top of these characteristics we can implement the following
You can do the same with backup by just putting a fast overlay
source and the backup, if your source is so slow, and then do
1. COR, we can cache clusters not only on writes but on reads
too, if we have
free space in ram-cache (and if not, do not cache at all, don't
disk-cache). It may be done like bdrv_write(...,
slow source --> fast overlay --> COR node --> backup filter
How will we check ram-cache size to make COR optional in this scheme?
You could either write it a bit simpler to only cache on writes and
put a COR node on top if desired; or you implement the read cache
functionality directly in the node, which may make it a bit more
complicated, but probably also faster.
(I guess you indeed want to go for faster when already writing a RAM
(I don't really understand what BDRV_REQ_UNNECESSARY is supposed to
When we do "CBW", we _must_ save data before guest write, so, we write
this data to the cache (or directly to target, like in current
When we do "COR", we _may_ save data to our ram-cache. It's safe to not
save data, as we can read it from active disk (data is not changed
BDRV_REQ_UNNECESSARY is a proposed interface to write this unnecessary
data to the cache: if ram-cache is full, cache will skip this write.
Hm, OK... But deciding for each request how much priority it should get
in a potential cache node seems like an awful lot of work. Well, I
don't even know what kind of requests you would deem unnecessary. If it
has something to do with the state of a dirty bitmap, then having global
dirty bitmaps might remove the need for such a request flag.
Yes, if we have some "shared fleecing object", accessible by
fleecing-cache filter (and backup job, if it is an internal backup),
we don't need
Hm. So what you want here is a special block driver or at least a
special interface that can give information to an outside tool, namely
the information you listed above.
If you want information about RAM-cached clusters, well, you can only
get that information from the RAM cache driver. It probably would be
allocation information, do we have any way of getting that out?
It seems you can get all of that (zero information and allocation
information) over NBD. Would that be enough?
it's a most generic and clean way, but I'm not sure that it will be
Intuitively I'd agree, but I suppose if NBD is written right, such a
request should be very fast and the response basically just consists of
the allocation information, so I don't suspect it can be much faster
(Unless you want some form of interrupts. I suppose NBD would be the
wrong interface, then.)
Yes, for external backup through NBD it's ok to get block status, but
for internal backup it seems faster to access shared fleecing object
(or global bitmaps, etc).
However, if we have some shared fleecing object, it's not a problem to
export it as a blockstatus metadata through NBD export..
I need several features, which are hard to implement using current
1. The scheme when we have a local cache as COW target and slow
How to do it now? Using two backups, one with sync=none... Not
this is right way.
If it works...
(I'd rather build simple building blocks that you can put together
something complicated that works for a specific solution)
exactly, I want to implement simple building blocks = filter nodes,
instead of implementing all the features in backup job.
Good, good. :-)
we'll need a possibility for backup(sync=none) to
not COW clusters, which are already copied to backup, and so on.
Isn't that the same as 2?
We can use one bitmap for 2 and 3, and drop bits from it, when
external-tool has read corresponding cluster from nbd-fleecing-export..
Oh, right, it needs to be modifiable from the outside. I suppose that
would be possible in NBD, too. (But I don't know exactly.)
I think it's natural to implement it through discard operation on
fleecing-cache node: if fleecing-user discard something, it will not
read it more and we can drop it from the cache and clear bit in shared
Then we can improve it by creating flag READ_ONCE for each READ
command or for the whole connection, to discard data after each read..
Or pass this flag to bdrv_read, to handle it in one command..
I thought we already had a backup filter node, so you wouldn't have
what is the difference between separate filter driver and backup
I don't think that will be any simpler.
I mean, it would make blockdev-copy simpler, because we could
immediately replace backup by mirror, and then we just have mirror,
which would then automatically become blockdev-copy...
But it's not really going to be simpler, because whether you put the
copy-before-write logic into a dedicated block driver, or into the
backup filter driver, doesn't really make it simpler either way.
adding a new driver always is a bit more complicated, so there's
to create a new driver in that case.
But we don't, so there really is no difference. Well, apart from
able to share state easier when the driver is in the same file as
But if we make it separate - it will be a separate "building block" to
be reused in different schemes.
Well, and it probably makes sense to have some form of RAM-cache
I think it will lead to larger io overhead: all clusters will go
overlay, not only guest-written clusters, for which we did not
it should not care about guest writes, it copies clusters from a
snapshot which is not changing in time. This job should follow
from fleecing scheme .
What about the target?
We can use separate node as target, and copy from fleecing cache
to the target.
If we have only ram-cache, it would be equal to current approach
(data is copied
directly to the target, even on COW). If we have both ram- and
disk- caches, it's
a cool solution for slow-target: instead of make guest wait for
long write to
backup target (when ram-cache is full) we can write to
disk-cache which is local
Or you backup to a fast overlay over a slow target, and run a live
commit on the side.
to copy them..
Then that'd be your fast overlay.
but there no reasons to copy all the data through the cache: we need it
only for CBW.
Well, if there'd be a RAM-cache driver, you may use it for anything that
seems useful (I seem to remember there were some patches on the list
like three or four years ago...).
any way, I think it will be good if both schemes will be possible.
Because I thought we had it already. But we don't. So feel free
Another option is to combine fleecing cache and target somehow
(I didn't think
I think adding a specific fleecing target filter makes sense
gave many reasons for interesting new use cases that could emerge
about this really).
Finally, with one - two (three?) special filters we can
implement all current
fleecing/backup schemes in unique and very configurable way and
do a lot more
cool features and possibilities.
What do you think?
But I think adding a new fleecing-hook driver just means moving the
implementation from backup to that new driver.
But in the same time you say that it's ok to create backup-filter
(instead of write_notifier) and make it insertable by qapi? So, if I
implement it in block/backup, it's ok? Why not do it separately?
it separately. :-)
Ok, that's good :) . Then, I'll try to reuse the filter in backup
instead of write-notifiers, and understand do we really need internal
state of backup block-job or not.
PS: in background, I have unpublished work, aimed to parallelize
backup-job into several coroutines (like it is done for mirror,
clone cmd). And it's really hard.It creates queues of requests with
different priority, to handle CBW requests in common pipeline, it's
mostly a rewrite of block/backup. If we split CBW from backup to
separate filter-node, backup becomes very simple thing (copy clusters
from constant storage) and its parallelization becomes simpler.
If CBW is split from backup, maybe mirror could replace backup
immediately. You'd fleece to a RAM cache target and then mirror from
Hmm, good option. It would be just one mirror iteration.
But then I'll need to teach mirror to copy clusters with some
priorities, to avoid ram-cache overloading (and guest io hang).
It may be better to have a separate simple (a lot simpler than mirror)
block job for it. or use a backup. Anyway, it's a separate
building block, performance comparison will show better candidate.
(To be precise: The exact replacement would be an active mirror, so a
mirror with copy-mode=write-blocking, so it immediately writes the old
block to the target when it is changed in the source, and thus the RAM
cache could stay effectively empty.)
Hmm, or this way. So, actually for such thing, we need a cache node
which do absolutely nothing, write will be actually handled by mirror
job. But in this case we cant control size of actual ram cache: if
target is slow we will accumulate unfinished bdrv_mirror_top_pwritev
calls, which has allocated memory and waiting in a queue to create