qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-alloc


From: Eric Blake
Subject: Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images
Date: Wed, 21 Feb 2018 09:59:25 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0

On 02/21/2018 09:00 AM, Eric Blake wrote:
On 02/21/2018 04:04 AM, Alberto Garcia wrote:
On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:

I was also preparing a patch to change this, but you arrived first :-)

So, it's time to cut back on the waste.  A compressed cluster
will NEVER occupy more than an uncompressed cluster


And here, csize can only get smaller than the length picked by nb_csectors.  Therefore, csize is GUARANTEED to be <= c->sector_size.


- Solution a: check that csize < s->cluster_size and return an error if
   it's not. However! although QEMU won't produce an image with a
   compressed cluster that is larger than the uncompressed one, the qcow2
   on-disk format in principle allows for that, so arguably we should
   accept it.

No, the qcow2 on-disk format does not have enough bits reserved for that.  It is impossible to store an inflated cluster, because you don't have enough bits to record it.

Okay, the spec is WRONG, compared to our code base.


That said, we MAY have a bug, more likely to be visible in 512-byte clusters but possible on any size.  While the 'number sectors' field IS sufficient for any compressed cluster starting at offset 0 in the cluster, we run into issues if the starting offset is later in the cluster.  That is, even though the compressed length of a 512-byte cluster is <= 512 (or we wouldn't compress it), if we start at offset 510, we NEED to read the next cluster to get the rest of the compressed stream - but at 512-byte clusters, there are 0 bits reserved for 'number sectors'.  Per the above calculations with the example offset of 510, nb_csectors would be 1 (it can't be anything else for a 512-byte cluster image), and csize would then be 2 bytes, which is insufficient for reading back enough data to reconstruct the cluster.

In fact, here's a demonstration of a discrepancy, where qemu-img and John's qcheck tool [1] disagree about the validity of an image created by qemu (although it may just be that qcheck hasn't yet learned about compressed clusters):

[1]https://github.com/jnsnow/qcheck

$ f=12345678
$ f=$f$f$f$f # 32
$ f=$f$f$f$f # 128
$ f=$f$f$f$f # 512
$ f=$f$f$f$f # 2k
$ f=$f$f$f$f # 8k
$ f=$f$f$f$f # 32k
$ f=$f$f$f$f # 128k
$ printf "$f" > data
$ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \
  data data.qcow2
$ qemu-img check data.qcow2
No errors were found on the image.
256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed clusters
Image end offset: 18944
$ ./qcheck data.qcow2
...
== L2 Tables ==

== L2 cluster l1[0] : 0x0000000000000800 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[1] : 0x0000000000000e00 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[2] : 0x0000000000001400 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[3] : 0x0000000000001a00 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
L2 tables: Could not complete analysis, 257 problems found


== Reference Count Analysis ==

Refcount analysis: 00 vacant clusters
Refcount analysis: 04 leaked clusters
Refcount analysis: 00 ghost clusters
Refcount analysis: 04 miscounted clusters
Refcount analysis: 8 problems found


== Cluster Counts ==

Metadata: 0x1000
Data: 0x800
Leaked: 0x800
Vacant: 0x0
total: 0x2000
qcheck: 73 problems found



Not true.  It is (cluster_bits - 9) or (cluster_size / 512).  Remember, x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = 61.  The 'number sectors' field is then bits x+1 - 61 (but you can't have a bitfield occupying bit 62 upto 61; especially since bit 62 is the bit for compressed cluster).

So instead of blindly reading the spec, I'm now going to single-stepping through the 'qemu-img convert' command line above, to see what REALLY happens:

Line numbers from commit a6e0344fa0
$ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 data data.qcow2
...
(gdb) b qcow2_alloc_bytes
Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052.
(gdb) r
Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
    address@hidden, address@hidden)
    at block/qcow2-refcount.c:1052
1052    {
(gdb)

So we are compressing 512 bytes down to 15 every time, which means that after 34 clusters compressed, we should be at offset 510. Let's resume debugging:

(gdb) c 34
Will ignore next 33 crossings of breakpoint 1.  Continuing.
[Thread 0x7fffe3cfe700 (LWP 32229) exited]
[New Thread 0x7fffe3cfe700 (LWP 32300)]
[New Thread 0x7fffe25ed700 (LWP 32301)]

Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
    address@hidden, address@hidden)
    at block/qcow2-refcount.c:1052
1052    {
(gdb) n
1053        BDRVQcow2State *s = bs->opaque;
(gdb)
1058        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES);
(gdb)
1059        assert(size > 0 && size <= s->cluster_size);
(gdb) p s->free_byte_offset
$2 = 3070
(gdb) p 3070%512
$3 = 510
...
(gdb)
1076        free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
(gdb)
1078            if (!offset || free_in_cluster < size) {
(gdb) p free_in_cluster
$4 = 2
1079 int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size);
(gdb)
1080                if (new_cluster < 0) {
(gdb)
1084                if (new_cluster == 0) {
(gdb)
1091 if (!offset || ROUND_UP(offset, s->cluster_size) != new_cluster) {
(gdb)
1095                    free_in_cluster += s->cluster_size;
(gdb)
1099            assert(offset);

so we got a contiguous cluster, and our goal is to let the caller bleed the compressed cluster into to the tail of the current sector and into the head of the next cluster. Continuing:

(gdb) fin
Run till exit from #0  qcow2_alloc_bytes (address@hidden,
    address@hidden) at block/qcow2-refcount.c:1118
[Thread 0x7fffe25ed700 (LWP 32301) exited]
[Thread 0x7fffe3cfe700 (LWP 32300) exited]
qcow2_alloc_compressed_cluster_offset (address@hidden,
    address@hidden, address@hidden)
    at block/qcow2-cluster.c:768
768         if (cluster_offset < 0) {
Value returned is $5 = 3070

(gdb) n
773         nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
(gdb)
774                       (cluster_offset >> 9);
(gdb)
773         nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
(gdb)
777                           ((uint64_t)nb_csectors << s->csize_shift);
(gdb) l
772     
773         nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
774                       (cluster_offset >> 9);
775     
776         cluster_offset |= QCOW_OFLAG_COMPRESSED |
777                           ((uint64_t)nb_csectors << s->csize_shift);
778     
779         /* update L2 table */
780     
781         /* compressed clusters never have the copied flag */
(gdb) p nb_csectors
$6 = 1
(gdb) p s->csize_shift
$7 = 61
(gdb) p/x cluster_offset
$8 = 0xbfe
(gdb) n
776         cluster_offset |= QCOW_OFLAG_COMPRESSED |
(gdb)
783         BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
(gdb) p/x cluster_offset
$9 = 0x6000000000000bfe

Where is s->csize_shift initialized?  In qcow2_do_open():

    s->csize_shift = (62 - (s->cluster_bits - 8));
    s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;
    s->cluster_offset_mask = (1LL << s->csize_shift) - 1;

Revisiting the wording in the spec:

Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):

    Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a
                    cluster boundary!

       x+1 - 61:    Compressed size of the images in sectors of 512 bytes

which says bits 0-61 are the host cluster offset, and 62-61 is the number of sectors. But our code sets s->csize_shift to divide this differently, at 0-60 and 61-61. Which means your earlier claim that there are enough 'number sector' bits to allow for up to 2*cluster_size as the size of the compressed stream (rather than my claim of exactly cluster_size) is right, and other implementations CAN inflate a cluster (if we don't document otherwise), and that even if they DON'T inflate, they can STILL cause a read larger than a cluster size when the offset is near the tail of one sector (most likely at 512-byte clusters, but remotely possible at other cluster sizes as well).

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]