[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [R
Re: [Qemu-devel] proposed qcow2 extension: cluster reservations [was: [RFC] Proposed qcow2 extension: subcluster allocation
Sat, 22 Apr 2017 19:56:57 +0200
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.0
On 21.04.2017 23:09, Eric Blake wrote:
> On 04/06/2017 11:40 AM, Eric Blake wrote:
>>> === Changes to the on-disk format ===
>>> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
>>> indicating the allocation status of each subcluster. There are three
>>> possible states (unallocated, allocated, all zeroes), so we need two
>>> bits per subcluster.
>> You also have to add a new incompatible feature bit, so that older tools
>> know they can't read the new image correctly, and therefore don't
>> accidentally corrupt it.
> As long as we are talking about incompatible feature bits, I had some
> thoughts about image mapping today.
> tl;dr: summary> As long as we are considering incompatible features, maybe we
> make it easier to have an image file that explicitly preserves
> guest=>host mapping, such that nothing the guest can do will reorder the
> mapping. This way, it would be possible to fully preallocate an image
> such that all guest offsets are adjusted by a constant offset to become
> the corresponding host offset (basically, no qcow2 metadata is
> interleaved in the middle of guest data).
> I don't see any way to do it on current qcow2 images, but with
> subclusters, you get it for free by having a cluster with an offset but
> with all subclusters marked as unallocated. But perhaps it is something
> orthogonal enough that we would want a separate incompatible feature bit
> for just this, without having subclusters at the same time.
> In the process of exploring the topic, I expose a couple of existing
> bugs in our qcow2 handling.
> Longer version:
> If I'm reading qcow2-clusters.c and qcow2-refcount.c correctly, our
> current implementation of bdrv_discard() says that except for clusters
> already marked QCOW2_CLUSTER_ZERO, we will unconditionally remove the L2
> mapping of the address.
As I've said, I think the ZERO bit is just a bug and we should free
preallocated zero clusters.
> Whether we actually punch a hole in the
> underlying image, or merely add it to a list of free clusters for use in
> subsequent allocations, is later decided based on
> s->discard_passthrough[type] (that is, the caller can pass different
> type levels that control whether to never punch, always punch, or let
> the blockdev parameters of the caller control whether to punch).
> Presumably, if I've preallocated my image because I want to guarantee
> enough host space, then I would use blockdev parameters that ensure that
> guest actions never punch a hole. But based on my reading, I still have
> the situation that if I initially preallocated things so that guest
> cluster 0 and 1 are consecutive clusters in the host, and the guest
> triggers bdrv_pdiscard() over both clusters, then writes to cluster 1
> before cluster 0, then even though I'm not changing the underlying
> allocation of the host file, I _am_ resulting in fragmentation in the
> qcow2 file, where cluster 1 in the guest now falls prior to cluster 0.
> But if we can preserve mappings of clusters that are explicitly marked
> zero, I started to wonder if it would also be reasonable to have a mode
> where we could ALWAYS preserve mappings. Adding another bit that says
> that a cluster has a reserved mapping, but still defers to the backing
> file for its current data, would let us extend the existing behavior of
> map-preservation on write zeros to work with ALL writes, when writing to
> a fully pre-allocated image.
Yes, and it also means that we may want to think about implementation a
preallocation mode in qemu which puts all of the data into a single
consecutive chunk (as you have hinted at somewhere above).
> When I chatted with Max on IRC about the idea, we said this:
> <mreitz> I mean, sure, we can add both, but I'd still want them two be
> two incompatible bits
> <eblake> if you want the features to be orthogonal (with exponentially
> more cases to check), then having multiple incompatible bits is okay
> <mreitz> Because FEATURE_BIT_UNALLOCATED_AND_SUBCLUSTERS sounds weird
> and FEATURE_BIT_EXTENDED_L2_ENTRIES a bit pretentious
> <mreitz> Well, orthogonal is a good question. If we want to have an
> UNALLOCATED flag we should think so before adding subclusters
> <mreitz> Because then we should at least make clear that the ZERO flag
> for a subcluster requires the ALLOCATED flag to be set or something
> <mreitz> So we can reserve ZERO/!ALLOCATED for the case where you want
> to fall through to the backing file
> So, if you got this far in reading, the question becomes whether having
> a mode where you can mark a cluster as mapping-reserved-but-unallocated
> has enough use case to be worth pursuing, knowing that it will burn an
> incompatible feature bit; or if it should be rolled into the subcluster
> proposal, or whether it's not a feature that anyone needs after all.
I just forgot that just saying !ALLOCATED will be enough, regardless of
the ZERO flag... Yeah, subclusters will give us this for free, and I
think it's therefore reasonable to just require them if you want this
feature (preallocation with a backing file).
> And meanwhile, it looks like I have some patches to propose (and
> qemu-iotests to write) if I can help fix the bugs I've pointed out.
You mean these?
I'll send them once I've written tests.
Description: OpenPGP digital signature