qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH v5 5/6] qcow2: consider in-flight-writes when freeing clusters


From: Vladimir Sementsov-Ogievskiy
Subject: [PATCH v5 5/6] qcow2: consider in-flight-writes when freeing clusters
Date: Fri, 26 Mar 2021 23:00:44 +0300

We have a bug in qcow2: assume we've started data write into host
cluster A. s->lock is unlocked. During the write the refcount of
cluster A may become zero, cluster may be reallocated for other needs,
and our in-flight write become a use-after-free.

To fix the bug let's do the following. Or better, let's start from what
we have now:

Now we consider cluster "free", when its refcount is 0. When cluster
becomes "free" we also update s->free_cluster_index and optionally
discard it on bs->file level. These two operations are done in same
s->lock critical section where refcount becomes 0 (and this all is in
update_refcount()). Calling update_refcount() is wrong. It's ofcourse
done sometimes, as not everything is moved to coroutine for now..
Still, it's out of our topic.

Later, we can reallocate "free" cluster in alloc_clusters_noref() and
qcow2_alloc_clusters_at(), where is_cluster_free() is used.

OK, to correctly handle in-flight writes, let's modify a concept of
"free" cluster, so that cluster is "free" when its refcount is 0 and
there no inflight writes.

So, we discard the cluster at bs->file level, update
s->free_cluster_index and allow reallocation only when both refcount
and inflight-write-cnt becomes both zero. It may happen either in
update_refcount() or in update_inflight_write_cnt().

Consider update_refcount() first. Here we update refcounts metadata we
must be under s->lock. So, if we catch a situation when refcount
becomes 0 but there are in-flight writes we change a behavior so that
we don't update s->free_cluster_index, and do not discard the cluster.
This will be done by update_inflight_write_cnt() later. So, we save
needed information into Qcow2InFlightWriteCounter structure, so that
further update_inflight_write_cnt() do not need to load the refcount.

Now what about update_inflight_write_cnt()? Here things become more
interesting, because we can avoid s->lock. If
inflight-write-cnt + refcount is still positive, we don't have any
yield point, just update inflight write counter and we are done.
If inflight-write-cnt becomes 0 and refcount is already 0, we just need
to keep inflight-write-counter=1 during the discard operation (if it
is needed) and then drop the counter and update s->free_cluster_index.

Note, that at this point we only implement the whole infrastructure for
in-flight writes handling. Nobody now call qcow2_inflight_writes_inc()
or qcow2_inflight_writes_dec(). It's a deal of the following patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2-refcount.c | 70 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 64 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index eedc83ea4a..9a0d6570a5 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -805,6 +805,15 @@ typedef struct Qcow2InFlightWriteCounter {
      * 0 the entry is removed from s->inflight_writes_counters.
      */
     uint64_t inflight_writes_cnt;
+
+    /* For convenience, keep cluster_index here */
+    int64_t cluster_index;
+
+    /* Cluster refcount is known to be zero */
+    bool refcount_zero;
+
+    /* Cluster refcount was made zero with this discard-type */
+    enum qcow2_discard_type last_discard_type;
 } Qcow2InFlightWriteCounter;
 
 /* Find Qcow2InFlightWriteCounter corresponding to @cluster_index */
@@ -845,6 +854,7 @@ update_inflight_write_cnt(BlockDriverState *bs, int64_t 
offset, int64_t length,
         if (!decrease) {
             if (!infl) {
                 infl = g_new0(Qcow2InFlightWriteCounter, 1);
+                infl->cluster_index = cluster_index;
                 g_hash_table_insert(s->inflight_writes_counters,
                                     g_memdup(&cluster_index,
                                              sizeof(cluster_index)), infl);
@@ -857,10 +867,40 @@ update_inflight_write_cnt(BlockDriverState *bs, int64_t 
offset, int64_t length,
         assert(infl);
         assert(infl->inflight_writes_cnt >= 1);
 
-        infl->inflight_writes_cnt--;
+        if (infl->inflight_writes_cnt > 1) {
+            infl->inflight_writes_cnt--;
+            continue;
+        }
 
-        if (infl->inflight_writes_cnt == 0) {
+        if (!infl->refcount_zero) {
+            /*
+             * All in-flight writes are finished, but cluster is still in use,
+             * nothing to do now.
+             */
             g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
+            continue;
+        }
+
+        /*
+         * OK. At this point there no more refcounts and no more in-flight
+         * writes. Cluster is almost free. But we probably want to do one more
+         * in-flight operation: discard. So we keep inflight_writes_cnt = 1
+         * during the discard.
+         */
+        if (s->discard_passthrough[infl->last_discard_type]) {
+            int64_t cluster_offset = cluster_index << s->cluster_bits;
+            if (s->cache_discards) {
+                update_refcount_discard(bs, cluster_offset, s->cluster_size);
+            } else {
+                /* Discard is optional, ignore the return value */
+                bdrv_pdiscard(bs->file, cluster_offset, s->cluster_size);
+            }
+        }
+
+        g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
+
+        if (cluster_index < s->free_cluster_index) {
+            s->free_cluster_index = cluster_index;
         }
     }
 }
@@ -970,6 +1010,7 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
         s->set_refcount(refcount_block, block_index, refcount);
 
         if (refcount == 0) {
+            Qcow2InFlightWriteCounter *infl;
             void *table;
 
             table = qcow2_cache_is_table_offset(s->refcount_block_cache,
@@ -986,8 +1027,24 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
                 qcow2_cache_discard(s->l2_table_cache, table);
             }
 
-            if (s->discard_passthrough[type]) {
-                update_refcount_discard(bs, cluster_offset, s->cluster_size);
+            infl = find_infl_wr(s, cluster_index);
+            if (infl) {
+                /*
+                 * Refcount becomes zero, but there are still in-flight writes
+                 * to the cluster. update_inflight_write_cnt() will take care
+                 * of discarding the cluster and updating 
s->free_cluster_index.
+                 */
+                infl->refcount_zero = true;
+                infl->last_discard_type = type;
+            } else {
+                /* Refcount is zero and no in-fligth writes. Cluster is free. 
*/
+                if (cluster_index < s->free_cluster_index) {
+                    s->free_cluster_index = cluster_index;
+                }
+                if (s->discard_passthrough[type]) {
+                    update_refcount_discard(bs, cluster_offset,
+                                            s->cluster_size);
+                }
             }
         }
     }
@@ -1049,7 +1106,7 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,
 
 
 /*
- * Cluster is free when its refcount is 0
+ * Cluster is free when its refcount is 0 and there is no in-flight writes
  *
  * Return < 0 if failed to get refcount
  *          0 if cluster is not free
@@ -1057,6 +1114,7 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,
  */
 static int is_cluster_free(BlockDriverState *bs, int64_t cluster_index)
 {
+    BDRVQcow2State *s = bs->opaque;
     int ret;
     uint64_t refcount;
 
@@ -1065,7 +1123,7 @@ static int is_cluster_free(BlockDriverState *bs, int64_t 
cluster_index)
         return ret;
     }
 
-    return refcount == 0;
+    return refcount == 0 && !find_infl_wr(s, cluster_index);
 }
 
 /* return < 0 if error */
-- 
2.29.2




reply via email to

[Prev in Thread] Current Thread [Next in Thread]