[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection
From: |
Peter Lieven |
Subject: |
Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection |
Date: |
Sat, 4 Dec 2021 10:53:02 +0100 |
> Am 04.12.2021 um 00:04 schrieb Vladimir Sementsov-Ogievskiy
> <vsementsov@virtuozzo.com>:
>
> 03.12.2021 14:17, Peter Lieven wrote:
>>> Am 19.05.21 um 18:48 schrieb Kevin Wolf:
>>> Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
>>>> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> 20.04.2021 18:04, Kevin Wolf wrote:
>>>>>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>> 15.04.2021 18:22, Kevin Wolf wrote:
>>>>>>>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed
>>>>>>>> areas
>>>>>>>> like non-zero data if the end of the checked area isn't aligned. This
>>>>>>>> can improve the efficiency of the conversion and was introduced in
>>>>>>>> commit 8dcd3c9b91a.
>>>>>>>>
>>>>>>>> However, it comes with a correctness problem: qemu-img convert is
>>>>>>>> supposed to sparsify areas that contain only zeros, which it doesn't do
>>>>>>>> any more. It turns out that this even happens when not only the
>>>>>>>> unaligned area is zeroed, but also the blocks before and after it. In
>>>>>>>> the bug report, conversion of a fragmented 10G image containing only
>>>>>>>> zeros resulted in an image consuming 2.82 GiB even though the expected
>>>>>>>> size is only 4 KiB.
>>>>>>>>
>>>>>>>> As a tradeoff between both, let's ignore zeroed sectors only after
>>>>>>>> non-zero data to fix the alignment, but if we're only looking at zeros,
>>>>>>>> keep them as such, even if it may mean additional RMW cycles.
>>>>>>>>
>>>>>>> Hmm.. If I understand correctly, we are going to do unaligned
>>>>>>> write-zero. And that helps.
>>>>>> This can happen (mostly raw images on block devices, I think?), but
>>>>>> usually it just means skipping the write because we know that the target
>>>>>> image is already zeroed.
>>>>>>
>>>>>> What it does mean is that if the next part is data, we'll have an
>>>>>> unaligned data write.
>>>>>>
>>>>>>> Doesn't that mean that alignment is wrongly detected?
>>>>>> The problem is that you can have bdrv_block_status_above() return the
>>>>>> same allocation status multiple times in a row, but *pnum can be
>>>>>> unaligned for the conversion.
>>>>>>
>>>>>> We only look at a single range returned by it when detecting the
>>>>>> alignment, so it could be that we have zero buffers for both 0-11 and
>>>>>> 12-16 and detect two misaligned ranges, when both together are a
>>>>>> perfectly aligned zeroed range.
>>>>>>
>>>>>> In theory we could try to do some lookahead and merge ranges where
>>>>>> possible, which should give us the perfect result, but it would make the
>>>>>> code considerably more complicated. (Whether we want to merge them
>>>>>> doesn't only depend on the block status, but possibly also on the
>>>>>> content of a DATA range.)
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>> Oh, I understand now the problem, thanks for explanation.
>>>>>
>>>>> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors
>>>>> must not align it down, to be possibly "merged" with next chunk if it is
>>>>> zero too.
>>>>>
>>>>> But it's still good to align zeroes down, if data starts somewhere inside
>>>>> the buf, isn't it?
>>>>>
>>>>> what about something like this:
>>>>>
>>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>>> index babb5573ab..d1704584a0 100644
>>>>> --- a/qemu-img.c
>>>>> +++ b/qemu-img.c
>>>>> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t
>>>>> *buf, int n, int *pnum,
>>>>> }
>>>>> }
>>>>> + if (i == n) {
>>>>> + /*
>>>>> + * The whole buf is the same.
>>>>> + *
>>>>> + * if it's data, just return it. It's the old behavior.
>>>>> + *
>>>>> + * if it's zero, just return too. It will work good if target is
>>>>> alredy
>>>>> + * zeroed. And if next chunk is zero too we'll have no RMW and
>>>>> no reason
>>>>> + * to write data.
>>>>> + */
>>>>> + *pnum = i;
>>>>> + return !is_zero;
>>>>> + }
>>>>> +
>>>>> tail = (sector_num + i) & (alignment - 1);
>>>>> if (tail) {
>>>>> if (is_zero && i <= tail) {
>>>>> - /* treat unallocated areas which only consist
>>>>> - * of a small tail as allocated. */
>>>>> + /*
>>>>> + * For sure next sector after i is data, and it will rewrite
>>>>> this
>>>>> + * tail anyway due to RMW. So, let's just write data now.
>>>>> + */
>>>>> is_zero = false;
>>>>> }
>>>>> if (!is_zero) {
>>>>> - /* align up end offset of allocated areas. */
>>>>> + /* If possible, align up end offset of allocated areas. */
>>>>> i += alignment - tail;
>>>>> i = MIN(i, n);
>>>>> } else {
>>>>> - /* align down end offset of zero areas. */
>>>>> + /*
>>>>> + * For sure next sector after i is data, and it will rewrite
>>>>> this
>>>>> + * tail anyway due to RMW. Better is avoid RMW and write
>>>>> zeroes up
>>>>> + * to aligned bound.
>>>>> + */
>>>>> i -= tail;
>>>>> }
>>>>> }
>>>> I think we forgot to follow up on this. Has anyone tested this
>>>> suggestion?
>>>>
>>>> Otherwise, I would try to rerun the tests I did with the my old and
>>>> Kevins suggestion.
>>> I noticed earlier this week that these patches are still in my
>>> development branch, but didn't actually pick it up again yet. So feel
>>> free to try it out.
>> It seems this time I forgot to follow up. Is this topic still open?
>
> Most probably yes :) I now checked, that my proposed diff is still applicable
> to master and don't break compilation. So, if you have some test, you can
> check if it works better with the change.
Unfortunately the reproducer in the original ticket does not work for me. It
might also depend on the XFS version used.
What I can say is that your patch has no negative impact for my use cases.
If I find the time I will try the reproducer on CentOS7
Best,
Peter