qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [Qemu-devel] [PATCH] file-posix: Cache lseek result for data regions
Date: Fri, 25 Jan 2019 09:13:09 +0000

24.01.2019 19:36, Kevin Wolf wrote:
> Am 24.01.2019 um 17:18 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 24.01.2019 17:17, Kevin Wolf wrote:
>>> Depending on the exact image layout and the storage backend (tmpfs is
>>> konwn to have very slow SEEK_HOLE/SEEK_DATA), caching lseek results can
>>> save us a lot of time e.g. during a mirror block job or qemu-img convert
>>> with a fragmented source image (.bdrv_co_block_status on the protocol
>>> layer can be called for every single cluster in the extreme case).
>>>
>>> We may only cache data regions because of possible concurrent writers.
>>> This means that we can later treat a recently punched hole as data, but
>>> this is safe. We can't cache holes because then we might treat recently
>>> written data as holes, which can cause corruption.
>>>
>>> Signed-off-by: Kevin Wolf <address@hidden>
> 
>>> @@ -1555,8 +1561,17 @@ static int handle_aiocb_write_zeroes_unmap(void 
>>> *opaque)
>>>    {
>>>        RawPosixAIOData *aiocb = opaque;
>>>        BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>        int ret;
>>>    
>>> +    /* Invalidate seek_data_cache if it overlaps */
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
>>> +                        sdc->start > aiocb->aio_offset + 
>>> aiocb->aio_nbytes))
>>
>> to be presize: <= and >=
> 
> Yes, you're right.
> 
>>> +    {
>>> +        sdc->valid = false;
>>> +    }
>>> +
>>>        /* First try to write zeros and unmap at the same time */
>>>    
>>
>>
>> Why not to drop cache on handle_aiocb_write_zeroes()? Otherwise, we'll 
>> return DATA
>> for these regions which may unallocated read-as-zero, if I'm not mistaken.
> 
> handle_aiocb_write_zeroes() is not allowed to unmap things, so we don't
> need to invalidate the cache there.

So, you want to say, that for fallocated regions we always return just _DATA, 
without _ZERO?
If it is so, it's of course bad, it means that convert will have to copy (or at 
least read
and detect zeroes by hand, if enabled) write-zeroed-without-unmap areas.

Let's check (hmm, I had to use qemu-img map inside qemu-io, patch attached for 
it,
also I printed printf("%s\n", __func__) in handle_aiocb_write_zeroes_unmap and
handle_aiocb_write_zeroes):

Let's test:
]# cat test
./qemu-img create -f raw x 1M

./qemu-io -f raw x <<CMDS
write 0 1M
map
write -z 100K 100K
map
write -z -u 500K 100K
map
CMDS

rm -rf x


rm -rf x

before your patch:
]# ./test
Formatting 'x', fmt=raw size=1048576
qemu-io> wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0523 sec (19.093 MiB/sec and 19.0927 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": 
true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes
wrote 102400/102400 bytes at offset 102400
100 KiB, 1 ops; 0.0165 sec (5.898 MiB/sec and 60.3974 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": 
true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, 
"offset": 102400},
{ "start": 204800, "length": 843776, "depth": 0, "zero": false, "data": true, 
"offset": 204800}]
qemu-io> handle_aiocb_write_zeroes_unmap
wrote 102400/102400 bytes at offset 512000
100 KiB, 1 ops; 0.0001 sec (545.566 MiB/sec and 5586.5922 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": 
true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, 
"offset": 102400},
{ "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, 
"offset": 204800},
{ "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, 
"offset": 512000},
{ "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, 
"offset": 614400}]



after your patch:
# ./test
Formatting 'x', fmt=raw size=1048576
qemu-io> wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0768 sec (13.019 MiB/sec and 13.0195 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": 
true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes
wrote 102400/102400 bytes at offset 102400
100 KiB, 1 ops; 0.0166 sec (5.883 MiB/sec and 60.2410 ops/sec)
qemu-io> [{ "start": 0, "length": 1048576, "depth": 0, "zero": false, "data": 
true, "offset": 0}]
qemu-io> handle_aiocb_write_zeroes_unmap
wrote 102400/102400 bytes at offset 512000
100 KiB, 1 ops; 0.0002 sec (469.501 MiB/sec and 4807.6923 ops/sec)
qemu-io> [{ "start": 0, "length": 102400, "depth": 0, "zero": false, "data": 
true, "offset": 0},
{ "start": 102400, "length": 102400, "depth": 0, "zero": true, "data": false, 
"offset": 102400},
{ "start": 204800, "length": 307200, "depth": 0, "zero": false, "data": true, 
"offset": 204800},
{ "start": 512000, "length": 102400, "depth": 0, "zero": true, "data": false, 
"offset": 512000},
{ "start": 614400, "length": 434176, "depth": 0, "zero": false, "data": true, 
"offset": 614400}]


So, you've changed behavior of block_status after write_zeroes without UNMAP 
for the worse.

Hmm, should I prepare patch for qemu-io? qemu-img map is definitely better.

> 
>>>    #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>> @@ -1634,11 +1649,20 @@ static int handle_aiocb_discard(void *opaque)
>>>        RawPosixAIOData *aiocb = opaque;
>>>        int ret = -EOPNOTSUPP;
>>>        BDRVRawState *s = aiocb->bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>    
>>>        if (!s->has_discard) {
>>>            return -ENOTSUP;
>>>        }
>>>    
>>> +    /* Invalidate seek_data_cache if it overlaps */
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && !(sdc->end < aiocb->aio_offset ||
>>> +                        sdc->start > aiocb->aio_offset + 
>>> aiocb->aio_nbytes))
>>
>> and <= and >=
>>
>> and if add same to handle_aiocb_write_zeroes(), then it worth to
>> create helper function to invalidate cache.
> 
> Ok.
> 
>>> +    {
>>> +        sdc->valid = false;
>>> +    }
>>> +
>>>        if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>>>    #ifdef BLKDISCARD
>>>            do {
>>> @@ -2424,6 +2448,8 @@ static int coroutine_fn 
>>> raw_co_block_status(BlockDriverState *bs,
>>>                                                int64_t *map,
>>>                                                BlockDriverState **file)
>>>    {
>>> +    BDRVRawState *s = bs->opaque;
>>> +    struct seek_data_cache *sdc;
>>>        off_t data = 0, hole = 0;
>>>        int ret;
>>>    
>>> @@ -2439,6 +2465,14 @@ static int coroutine_fn 
>>> raw_co_block_status(BlockDriverState *bs,
>>>            return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>>>        }
>>>    
>>> +    sdc = &s->seek_data_cache;
>>> +    if (sdc->valid && sdc->start <= offset && sdc->end > offset) {
>>> +        *pnum = MIN(bytes, sdc->end - offset);
>>> +        *map = offset;
>>> +        *file = bs;
>>> +        return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>>> +    }
>>> +
>>>        ret = find_allocation(bs, offset, &data, &hole);
>>>        if (ret == -ENXIO) {
>>>            /* Trailing hole */
>>> @@ -2451,14 +2485,27 @@ static int coroutine_fn 
>>> raw_co_block_status(BlockDriverState *bs,
>>>        } else if (data == offset) {
>>>            /* On a data extent, compute bytes to the end of the extent,
>>>             * possibly including a partial sector at EOF. */
>>> -        *pnum = MIN(bytes, hole - offset);
>>> +        *pnum = hole - offset;
>>
>> hmm, why? At least you didn't mention it in commit-message..
> 
> We want to cache the whole range returned by lseek(), not just whatever
> the raw_co_block_status() caller wanted to know.
> 
> For the returned value, *pnum is adjusted to MIN(bytes, *pnum) below...

Oops, stupid question it was, sorry:(

> 
>>>            ret = BDRV_BLOCK_DATA;
>>>        } else {
>>>            /* On a hole, compute bytes to the beginning of the next extent. 
>>>  */
>>>            assert(hole == offset);
>>> -        *pnum = MIN(bytes, data - offset);
>>> +        *pnum = data - offset;
>>>            ret = BDRV_BLOCK_ZERO;
>>>        }
>>> +
>>> +    /* Caching allocated ranges is okay even if another process writes to 
>>> the
>>> +     * same file because we allow declaring things allocated even if there 
>>> is a
>>> +     * hole. However, we cannot cache holes without risking corruption. */
>>> +    if (ret == BDRV_BLOCK_DATA) {
>>> +        *sdc = (struct seek_data_cache) {
>>> +            .valid  = true,
>>> +            .start  = offset,
>>> +            .end    = offset + *pnum,
>>> +        };
>>> +    }
>>> +
>>> +    *pnum = MIN(*pnum, bytes);
> 
> ...here.
> 
> So what we return doesn't change.
> 
>>>        *map = offset;
>>>        *file = bs;
>>>        return ret | BDRV_BLOCK_OFFSET_VALID;
> 
> Kevin
> 


-- 
Best regards,
Vladimir

Attachment: 0001-my.patch
Description: 0001-my.patch


reply via email to

[Prev in Thread] Current Thread [Next in Thread]