qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Race condition in overlayed qcow2?


From: Vladimir Sementsov-Ogievskiy
Subject: Re: Race condition in overlayed qcow2?
Date: Wed, 19 Feb 2020 19:07:48 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1

19.02.2020 17:32, dovgaluk wrote:
Hi!

I encountered a problem with record/replay of QEMU execution and figured out 
the following, when
QEMU is started with one virtual disk connected to the qcow2 image with applied 
'snapshot' option.

The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel 
subrequest handling in read and write"
introduces some kind of race condition, which causes difference in the data 
read from the disk.

I detected this by adding the following code, which logs IO operation checksum. 
And this checksum may be different in different runs of the same recorded 
execution.

logging in blk_aio_complete function:
         qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
         QEMUIOVector *qiov = acb->rwco.iobuf;
         if (qiov && qiov->iov) {
             size_t i, j;
             uint64_t sum = 0;
             int count = 0;
             for (i = 0 ; i < qiov->niov ; ++i) {
                 for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
                     sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
                     ++count;
                 }
             }
             qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", 
acb->rwco.offset, count, sum);
         }

I tried to get rid of aio task by patching qcow2_co_preadv_part:
ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov, 
qiov_offset);

That change fixed a bug, but I have no idea what to debug next to figure out 
the exact reason of the failure.

Do you have any ideas or hints?


Hi!

Hmm, do mean that read from the disk may return wrong data? It would be very 
bad of course :(
Could you provide a reproducer, so that I can look at it and debug?

What is exactly the case? May be you have other parallel aio operations to the 
same region?

Ideas to experiment:

1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
2. understand what is the case in code: is it read from one or several 
clusters, is it aligned,
what is the type of clusters, is encryption in use, compression?
3. understand what kind of data corruption. What we read instead of correct 
data? Just garbage, or may be zeroes, or what..

and of course best thing would be creating small reproducer, or test in 
tests/qemu-iotests


--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]