qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 ima


From: Denis Lunev
Subject: Re: [Qemu-devel] [RFC] Re-evaluating subcluster allocation for qcow2 images
Date: Thu, 27 Jun 2019 16:05:55 +0000

On 6/27/19 6:38 PM, Alberto Garcia wrote:
> On Thu 27 Jun 2019 04:19:25 PM CEST, Denis Lunev wrote:
>
>> Right now QCOW2 is not very efficient with default cluster size (64k)
>> for fast performance with big disks. Nowadays ppl uses really BIG
>> images and 1-2-3-8 Tb disks are really common. Unfortunately ppl want
>> to get random IO fast too.  Thus metadata cache should be in memory as
>> in the any other case we will get IOPSes halved (1 operation for
>> metadata cache read and one operation for real read). For 8 Tb image
>> this results in 1 Gb RAM for that. For 1 Mb cluster we get 64 Mb which
>> is much more reasonable.
> Correct, the L2 metadata size is a well-known problem that has been
> discussed extensively, and that has received plenty of attention.
>
>> Though with 1 Mb cluster the reclaim process becomes much-much
>> worse. I can not give exact number, unfortunately.  AFAIR the image
>> occupies 30-50% more space. Guys, I would appreciate if you will
>> correct me here with real numbers.
> Correct, because the cluster size is the smallest unit of allocation, so
> a 16KB write on an empty area of the image will always allocate a
> complete 1MB cluster.

>> Thus in respect to this patterns subclusters could give us benefits of
>> fast random IO and good reclaim rate.
> Exactly, but that fast random I/O would only happen when allocating new
> clusters. Once the clusters are allocated it doesn't provide any
> additional performance benefit.

No, I am talking about the situation after the allocation. That is the main
point why I have a feeling that sub-cluster could provide a benefit.

OK. The situation (1) is the following:
- the disk is completely allocated
- QCOW2 image size is 8 Tb
- we have image with 1 Mb cluster/64k sub-cluster (for simplicity)
- L2 metadata cache size is 128 Mb (64 Mb L2 tables, 64 Mb other data)
- holes are made on a sub-cluster bases, i.e. with 64 Kb granularity

In this case random IO test will give near native IOPS result. Metadata
is in memory, no additional reads are required. Wasted host filesystem
space (due to cluster size) is kept at minimum, i.e. on the level of
the "pre-subcluster" QCOW2.

Situation (2):
- 8 Tb QCOW2 image is completely allocated
- 1 Mb cluster size, 128 Mb L2 cache size

Near same performance as (1), but much less disk space savings for
holes.

Situation (3):
- 8 Tb QCOW2 image, completely allocated
- 64 Kb cluster size, 128 MB L2 cache

Random IO performance halved from (1) and (2) due to metadata
re-read for each subsequent operation. Same disk space savings
as in case (1).

Please note, I am not talking now about your case with COW. Here
the allocation is performed on the sub-cluster basis, i.e. the abscence
of the sub-cluster in the image means hole on that offset. This is
important difference.

>> I would consider 64k cluster/8k subcluster as too extreme for me.  In
>> reality we would end up with completely fragmented image very soon.
> You mean because of the 64k cluster size, or because of the 8k
> subcluster size? If it's the former, yes. If it's the latter, it can be
> solved by preallocating the cluster with fallocate(). But then you would
> lose the benefit of the good reclaim rate.

You are optimizing COW speed and your proposal is on that. Thus you
getting minimal allocation unit as a cluster. I am talking about a bit
different pattern of subcluster benefits when the offset allocation unit
is cluster while the space allocation unit is sub-cluster.

This is important difference and that is why I am talking that for my
case 8 Kb space allocation unit is too extreme. These case should
be somehow separated.

Den

reply via email to

[Prev in Thread] Current Thread [Next in Thread]