qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [RFC v2] new, node-graph-based fleecing and backup


From: Max Reitz
Subject: Re: [Qemu-block] [RFC v2] new, node-graph-based fleecing and backup
Date: Mon, 20 Aug 2018 19:25:10 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

On 2018-08-20 16:49, Vladimir Sementsov-Ogievskiy wrote:
> 20.08.2018 16:32, Max Reitz wrote:
>> On 2018-08-20 11:42, Vladimir Sementsov-Ogievskiy wrote:
>>> 18.08.2018 00:50, Max Reitz wrote:
>>>> On 2018-08-14 19:01, Vladimir Sementsov-Ogievskiy wrote:
>> [...]
>>
>>>>> Proposal:
>>>>>
>>>>> For fleecing we need two nodes:
>>>>>
>>>>> 1. fleecing hook. It's a filter which should be inserted on top of active
>>>>> disk. It's main purpose is handling guest writes by copy-on-write 
>>>>> operation,
>>>>> i.e. it's a substitution for write-notifier in backup job.
>>>>>
>>>>> 2. fleecing cache. It's a target node for COW operations by fleecing-hook.
>>>>> It also represents a point-in-time snapshot of active disk for the 
>>>>> readers.
>>>> It's not really COW, it's copy-before-write, isn't it?  It's something
>>>> else entirely.  COW is about writing data to an overlay *instead* of
>>>> writing it to the backing file.  Ideally, you don't copy anything,
>>>> actually.  It's just a side effect that you need to copy things if your
>>>> cluster size doesn't happen to match exactly what you're overwriting.
>>> Hmm. I'm not against. But COW term was already used in backup to
>>> describe this.
>> Bad enough. :-)
> 
> So, we agreed about new "CBW" abbreviation? :)

It is already used for the USB mass-storage command block wrapper, but I
suppose that is sufficiently different not to cause much confusion. :-)

(Or at least that's the only other use I know of.)

[...]

>>> 2. We already have fleecing scheme, when we should create some subgraph
>>> between nodes.
>> Yes, but how do the permissions work right now, and why wouldn't they
>> work with your schema?
> 
> now it uses backup job, with shared_perm = all for its source and target
> nodes.

Uh-huh.

So the issue is...  Hm, what exactly?  The backup node probably doesn't
want to share WRITE for the source anymore, as there is no real point in
doing so.  And for the target, the only problem may be to share
CONSISTENT_READ.  It is OK to share that in the fleecing case, but in
other cases maybe it isn't.  But that's easy enough to distinguish in
the driver.

The main issue I could see is that the overlay (the fleecing target)
might not share write permissions on its backing file (the fleecing
source)...  But your diagram shows (and bdrv_format_default_perms() as
well) that this is no the case, when the overlay is writable, the
backing file may be written to, too.

> (ha, you can look at the picture in "[PATCH v2 0/3] block nodes
> graph visualization")

:-)

>>> 3. If we move to filter-node instead of write_notifier, block job is not
>>> actually needed for fleecing, and it is good to drop it from the
>>> fleecing scheme, to simplify it, to make it more clear and transparent.
>> If that's possible, why not.  But again, I'm not sure whether that's
>> enough of a reason for the endavour, because whether you start a block
>> job or do some graph manipulation yourself is not really a difference in
>> complexity.
> 
> not "or" but "and": in current fleecing scheme we do both graph
> manipulations and block-job stat/cancel..

Hm!  Interesting.  I didn't know blockdev-backup didn't set the target's
backing file.  It makes sense, but I didn't think about it.

Well, still, my point was whether you do a blockdev-backup +
block-job-cancel, or a blockdev-add + blockdev-reopen + blockdev-reopen
+ blockdev-del...  If there is a difference, the former is going to be
simpler, probably.

(But if there are things you can't do with the current blockdev-backup,
then, well, that doesn't help you.)

> Yes, I agree, that there no real benefit in difficulty. I just thing,
> that if we have filter node which performs "CBW" operations, block-job
> backup(sync=none) becomes actually empty, it will do nothing.

On the code side, yes, that's true.

>> But it's mostly your call, since I suppose you'd be doing most of the work.
>>
>>> And finally, we will have unified filter-node-based scheme for backup
>>> and fleecing, modular and customisable.
>> [...]
>>
>>>>> Benefits, or, what can be done:
>>>>>
>>>>> 1. We can implement special Fleecing cache filter driver, which will be a 
>>>>> real
>>>>> cache: it will store some recently written clusters and RAM, it can have a
>>>>> backing (or file?) qcow2 child, to flush some clusters to the disk, etc. 
>>>>> So,
>>>>> for each cluster of active disk we will have the following 
>>>>> characteristics:
>>>>>
>>>>> - changed (changed in active disk since backup start)
>>>>> - copy (we need this cluster for fleecing user. For example, in RFC patch 
>>>>> all
>>>>> clusters are "copy", cow_bitmap is initialized to all ones. We can use 
>>>>> some
>>>>> existent bitmap to initialize cow_bitmap, and it will provide an 
>>>>> "incremental"
>>>>> fleecing (for use in incremental backup push or pull)
>>>>> - cached in RAM
>>>>> - cached in disk
>>>> Would it be possible to implement such a filter driver that could just
>>>> be used as a backup target?
>>> for internal backup we need backup-job anyway, and we will be able to
>>> create different schemes.
>>> One of my goals is the scheme, when we store old data from CBW
>>> operations into local cache, when
>>> backup target is remote, relatively slow NBD node. In this case, cache
>>> is backup source, not target.
>> Sorry, my question was badly worded.  My main point was whether you
>> could implement the filter driver in such a generic way that it wouldn't
>> depend on the fleecing-hook.
> 
> yes, I want my filter nodes to be self-sufficient entities. However it
> may be more effective to have some shared data, between them, for
> example, dirty-bitmaps, specifying drive clusters, to know which
> clusters are cached, which are changed, etc.

I suppose having global dirty bitmaps may make sense.

>> Judging from your answer and from the fact that you proposed calling the
>> filter node backup-filter and just using it for all backups, I suppose
>> the answer is "yes".  So that's good.
>>
>> (Though I didn't quite understand why in your example the cache would be
>> the backup source, when the target is the slow node...)
> 
> cache is a point-in-time view of active disk (actual source) for
> fleecing. So, we can start backup job to copy data from cache to target.

But wouldn't the cache need to be the immediate fleecing target for
this?  (And then you'd run another backup/mirror from it to copy the
whole disk to the real target.)

>>>>> On top of these characteristics we can implement the following features:
>>>>>
>>>>> 1. COR, we can cache clusters not only on writes but on reads too, if we 
>>>>> have
>>>>> free space in ram-cache (and if not, do not cache at all, don't write to
>>>>> disk-cache). It may be done like bdrv_write(..., BDRV_REQ_UNNECESARY)
>>>> You can do the same with backup by just putting a fast overlay between
>>>> source and the backup, if your source is so slow, and then do COR, i.e.:
>>>>
>>>> slow source --> fast overlay --> COR node --> backup filter
>>> How will we check ram-cache size to make COR optional in this scheme?
>> Yes, well, if you have a caching driver already, I suppose you can just
>> use that.
>>
>> You could either write it a bit simpler to only cache on writes and then
>> put a COR node on top if desired; or you implement the read cache
>> functionality directly in the node, which may make it a bit more
>> complicated, but probably also faster.
>>
>> (I guess you indeed want to go for faster when already writing a RAM
>> cache driver...)
>>
>> (I don't really understand what BDRV_REQ_UNNECESSARY is supposed to do,
>> though.)
> 
> When we do "CBW", we _must_ save data before guest write, so, we write
> this data to the cache (or directly to target, like in current approach).
> When we do "COR", we _may_ save data to our ram-cache. It's safe to not
> save data, as we can read it from active disk (data is not changed yet).
> BDRV_REQ_UNNECESSARY is a proposed interface to write this unnecessary
> data to the cache: if ram-cache is full, cache will skip this write.

Hm, OK...  But deciding for each request how much priority it should get
in a potential cache node seems like an awful lot of work.  Well, I
don't even know what kind of requests you would deem unnecessary.  If it
has something to do with the state of a dirty bitmap, then having global
dirty bitmaps might remove the need for such a request flag.

[...]

>> Hm.  So what you want here is a special block driver or at least a
>> special interface that can give information to an outside tool, namely
>> the information you listed above.
>>
>> If you want information about RAM-cached clusters, well, you can only
>> get that information from the RAM cache driver.  It probably would be
>> allocation information, do we have any way of getting that out?
>>
>> It seems you can get all of that (zero information and allocation
>> information) over NBD.  Would that be enough?
> 
> it's a most generic and clean way, but I'm not sure that it will be
> performance-effective.

Intuitively I'd agree, but I suppose if NBD is written right, such a
request should be very fast and the response basically just consists of
the allocation information, so I don't suspect it can be much faster
than that.

(Unless you want some form of interrupts.  I suppose NBD would be the
wrong interface, then.)

[...]

>>> I need several features, which are hard to implement using current scheme.
>>>
>>> 1. The scheme when we have a local cache as COW target and slow remote
>>> backup target.
>>> How to do it now? Using two backups, one with sync=none... Not sure that
>>> this is right way.
>> If it works...
>>
>> (I'd rather build simple building blocks that you can put together than
>> something complicated that works for a specific solution)
> 
> exactly, I want to implement simple building blocks = filter nodes,
> instead of implementing all the features in backup job.

Good, good. :-)

>>> 3. Then,
>>> we'll need a possibility for backup(sync=none) to
>>> not COW clusters, which are already copied to backup, and so on.
>> Isn't that the same as 2?
> 
> We can use one bitmap for 2 and 3, and drop bits from it, when
> external-tool has read corresponding cluster from nbd-fleecing-export..

Oh, right, it needs to be modifiable from the outside.  I suppose that
would be possible in NBD, too.  (But I don't know exactly.)

[...]

>>>> I don't think that will be any simpler.
>>>>
>>>> I mean, it would make blockdev-copy simpler, because we could
>>>> immediately replace backup by mirror, and then we just have mirror,
>>>> which would then automatically become blockdev-copy...
>>>>
>>>> But it's not really going to be simpler, because whether you put the
>>>> copy-before-write logic into a dedicated block driver, or into the
>>>> backup filter driver, doesn't really make it simpler either way.  Well,
>>>> adding a new driver always is a bit more complicated, so there's that.
>>> what is the difference between separate filter driver and backup filter
>>> driver?
>> I thought we already had a backup filter node, so you wouldn't have had
>> to create a new driver in that case.
>>
>> But we don't, so there really is no difference.  Well, apart from being
>> able to share state easier when the driver is in the same file as the job.
> 
> But if we make it separate - it will be a separate "building block" to
> be reused in different schemes.

Absolutely true.

>>>>> it should not care about guest writes, it copies clusters from a kind of
>>>>> snapshot which is not changing in time. This job should follow 
>>>>> recommendations
>>>>> from fleecing scheme [7].
>>>>>
>>>>> What about the target?
>>>>>
>>>>> We can use separate node as target, and copy from fleecing cache to the 
>>>>> target.
>>>>> If we have only ram-cache, it would be equal to current approach (data is 
>>>>> copied
>>>>> directly to the target, even on COW). If we have both ram- and disk- 
>>>>> caches, it's
>>>>> a cool solution for slow-target: instead of make guest wait for long 
>>>>> write to
>>>>> backup target (when ram-cache is full) we can write to disk-cache which 
>>>>> is local
>>>>> and fast.
>>>> Or you backup to a fast overlay over a slow target, and run a live
>>>> commit on the side.
>>> I think it will lead to larger io overhead: all clusters will go through
>>> overlay, not only guest-written clusters, for which we did not have time
>>> to copy them..
>> Well, and it probably makes sense to have some form of RAM-cache driver.
>>  Then that'd be your fast overlay.
> 
> but there no reasons to copy all the data through the cache: we need it
> only for CBW.

Well, if there'd be a RAM-cache driver, you may use it for anything that
seems useful (I seem to remember there were some patches on the list
like three or four years ago...).

> any way, I think it will be good if both schemes will be possible.
> 
>>>>> Another option is to combine fleecing cache and target somehow (I didn't 
>>>>> think
>>>>> about this really).
>>>>>
>>>>> Finally, with one - two (three?) special filters we can implement all 
>>>>> current
>>>>> fleecing/backup schemes in unique and very configurable way  and do a lot 
>>>>> more
>>>>> cool features and possibilities.
>>>>>
>>>>> What do you think?
>>>> I think adding a specific fleecing target filter makes sense because you
>>>> gave many reasons for interesting new use cases that could emerge from 
>>>> that.
>>>>
>>>> But I think adding a new fleecing-hook driver just means moving the
>>>> implementation from backup to that new driver.
>>> But in the same time you say that it's ok to create backup-filter
>>> (instead of write_notifier) and make it insertable by qapi? So, if I
>>> implement it in block/backup, it's ok? Why not do it separately?
>> Because I thought we had it already.  But we don't.  So feel free to do
>> it separately. :-)
> 
> Ok, that's good :) . Then, I'll try to reuse the filter in backup
> instead of write-notifiers, and understand do we really need internal
> state of backup block-job or not.
> 
>> Max
>>
> 
> PS: in background, I have unpublished work, aimed to parallelize
> backup-job into several coroutines (like it is done for mirror, qemu-img
> clone cmd). And it's really hard.It creates queues of requests with
> different priority, to handle CBW requests in common pipeline, it's
> mostly a rewrite of block/backup. If we split CBW from backup to
> separate filter-node, backup becomes very simple thing (copy clusters
> from constant storage) and its parallelization becomes simpler.

If CBW is split from backup, maybe mirror could replace backup
immediately.  You'd fleece to a RAM cache target and then mirror from there.

(To be precise: The exact replacement would be an active mirror, so a
mirror with copy-mode=write-blocking, so it immediately writes the old
block to the target when it is changed in the source, and thus the RAM
cache could stay effectively empty.)

> I don't say throw the backup away, but I have several ideas, which may
> alter current approach. They may live in parallel with current backup
> path, or replace it in future, if they will be more effective.

Thing is, contrary to the impression I've probably given, we do want to
throw away backup sooner or later.  We want a single block job
(blockdev-copy) that unifies mirror, backup, and commit.

(mirror already basically supersedes commit, with live commit just being
exactly mirror; the main problem is integrating backup.  But with a
fleecing node and a RAM cache target, that would suddenly be really
simple, I assume.)

((All that's missing is sync=top, where the mirror would need to not
only check its source (which would be the RAM cache), but also its
backing file; and sync=incremental, which just isn't there with mirror
at all.  OTOH, it may be possible to implement both modes simply in the
fleecing/backup node, so it only copies that respective data to the
target and the mirror simply sees nothing else.))

Max

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]