qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2 incrementa


From: He, Junyan
Subject: Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2 incremental snapshot
Date: Mon, 28 May 2018 07:01:37 +0000

Hi yang,

Alibaba made this proposal for NVDimm snapshot optimization, 
can you give some advice about this discussion?

Thanks



-----Original Message-----
From: Stefan Hajnoczi [mailto:address@hidden 
Sent: Monday, May 14, 2018 9:49 PM
To: Kevin Wolf <address@hidden>
Cc: Stefan Hajnoczi <address@hidden>; Pankaj Gupta <address@hidden>; He, Junyan 
<address@hidden>; address@hidden; qemu block <address@hidden>; Max Reitz 
<address@hidden>
Subject: Re: [Qemu-block] [Qemu-devel] Some question about savem/qcow2 
incremental snapshot

On Fri, May 11, 2018 at 07:25:31PM +0200, Kevin Wolf wrote:
> Am 10.05.2018 um 10:26 hat Stefan Hajnoczi geschrieben:
> > On Wed, May 09, 2018 at 07:54:31PM +0200, Max Reitz wrote:
> > > On 2018-05-09 12:16, Stefan Hajnoczi wrote:
> > > > On Tue, May 08, 2018 at 05:03:09PM +0200, Kevin Wolf wrote:
> > > >> Am 08.05.2018 um 16:41 hat Eric Blake geschrieben:
> > > >>> On 12/25/2017 01:33 AM, He Junyan wrote:
> > > >> 2. Make the nvdimm device use the QEMU block layer so that it is backed
> > > >>    by a non-raw disk image (such as a qcow2 file representing the
> > > >>    content of the nvdimm) that supports snapshots.
> > > >>
> > > >>    This part is hard because it requires some completely new
> > > >>    infrastructure such as mapping clusters of the image file to guest
> > > >>    pages, and doing cluster allocation (including the copy on write
> > > >>    logic) by handling guest page faults.
> > > >>
> > > >> I think it makes sense to invest some effort into such 
> > > >> interfaces, but be prepared for a long journey.
> > > > 
> > > > I like the suggestion but it needs to be followed up with a 
> > > > concrete design that is feasible and fair for Junyan and others to 
> > > > implement.
> > > > Otherwise the "long journey" is really just a way of rejecting 
> > > > this feature.
> > > > 
> > > > Let's discuss the details of using the block layer for NVDIMM 
> > > > and try to come up with a plan.
> > > > 
> > > > The biggest issue with using the block layer is that persistent 
> > > > memory applications use load/store instructions to directly 
> > > > access data.  This is fundamentally different from the block 
> > > > layer, which transfers blocks of data to and from the device.
> > > > 
> > > > Because of block DMA, QEMU is able to perform processing at each 
> > > > block driver graph node.  This doesn't exist for persistent 
> > > > memory because software does not trap I/O.  Therefore the 
> > > > concept of filter nodes doesn't make sense for persistent memory 
> > > > - we certainly do not want to trap every I/O because performance would 
> > > > be terrible.
> > > > 
> > > > Another difference is that persistent memory I/O is synchronous.
> > > > Load/store instructions execute quickly.  Perhaps we could use 
> > > > KVM async page faults in cases where QEMU needs to perform 
> > > > processing, but again the performance would be bad.
> > > 
> > > Let me first say that I have no idea how the interface to NVDIMM looks.
> > > I just assume it works pretty much like normal RAM (so the 
> > > interface is just that it’s a part of the physical address space).
> > > 
> > > Also, it sounds a bit like you are already discarding my idea, but 
> > > here goes anyway.
> > > 
> > > Would it be possible to introduce a buffering block driver that 
> > > presents the guest an area of RAM/NVDIMM through an NVDIMM 
> > > interface (so I suppose as part of the guest address space)?  For 
> > > writing, we’d keep a dirty bitmap on it, and then we’d 
> > > asynchronously move the dirty areas through the block layer, so 
> > > basically like mirror.  On flushing, we’d block until everything is clean.
> > > 
> > > For reading, we’d follow a COR/stream model, basically, where 
> > > everything is unpopulated in the beginning and everything is 
> > > loaded through the block layer both asynchronously all the time 
> > > and on-demand whenever the guest needs something that has not been loaded 
> > > yet.
> > > 
> > > Now I notice that that looks pretty much like a backing file model 
> > > where we constantly run both a stream and a commit job at the same time.
> > > 
> > > The user could decide how much memory to use for the buffer, so it 
> > > could either hold everything or be partially unallocated.
> > > 
> > > You’d probably want to back the buffer by NVDIMM normally, so that 
> > > nothing is lost on crashes (though this would imply that for 
> > > partial allocation the buffering block driver would need to know 
> > > the mapping between the area in real NVDIMM and its virtual 
> > > representation of it).
> > > 
> > > Just my two cents while scanning through qemu-block to find emails 
> > > that don’t actually concern me...
> > 
> > The guest kernel already implements this - it's the page cache and 
> > the block layer!
> > 
> > Doing it in QEMU with dirty memory logging enabled is less efficient 
> > than doing it in the guest.
> > 
> > That's why I said it's better to just use block devices than to 
> > implement buffering.
> > 
> > I'm saying that persistent memory emulation on top of the iscsi:// 
> > block driver (for example) does not make sense.  It could be 
> > implemented but the performance wouldn't be better than block I/O 
> > and the complexity/code size in QEMU isn't justified IMO.
> 
> I think it could make sense if you put everything together.
> 
> The primary motivation to use this would of course be that you can 
> directly map the guest clusters of a qcow2 file into the guest. We'd 
> potentially fault on the first access, but once it's mapped, you get 
> raw speed. You're right about flushing, and I was indeed thinking of 
> Pankaj's work there; maybe I should have been more explicit about that.
> 
> Now buffering in QEMU might come in useful when you want to run a 
> block job on the device. Block jobs are usually just temporary, and 
> accepting temporarily lower performance might be very acceptable when 
> the alternative is that you can't perform block jobs at all.

Why is buffering needed for block jobs?  They access the image using 
traditional block layer I/O requests.

> If we want to offer something nvdimm-like not only for the extreme 
> "performance only, no features" case, but as a viable option for the 
> average user, we need to be fast in the normal case, and allow to use 
> any block layer features without having to restart the VM with a 
> different storage device, even if at a performance penalty.

What are the details involved in making this possible?

Persistent memory does not trap I/O but that is what filter drivers and before 
write notifiers need.  So a page protection mechanism is required for the block 
layer to trap persistent memory accesses.

Next, this needs to be integrated with BdrvTrackedRequest and
req->serialising so that copy-on-read, blockjobs, etc work correctly
when both traditional block I/O requests from blockjobs and direct memory 
access from guest are taking place at the same time.

Page protection is only realistic with KVM async page faults, otherwise faults 
freeze the vcpu until they are resolved.  kvm.ko needs to return the page fault 
information to QEMU and QEMU must be able to resolve the async page fault once 
it has mapped.  Perhaps userfaultfd(2) can be used for this.

This is as far as I've gotten before thinking about how buffering would work.

> On iscsi, you still don't gain anything compared to just using a block 
> device, but support for that might just happen as a side effect when 
> you implement the interesting features.

If we get the feature for free as part of addressing another use case, I won't 
complain :).

Stefan

reply via email to

[Prev in Thread] Current Thread [Next in Thread]