qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] qemu-img: align next status sector on destination alignm


From: Maxim Levitsky
Subject: Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.
Date: Thu, 18 Mar 2021 11:57:07 +0200
User-agent: Evolution 3.36.5 (3.36.5-2.fc32)

On Thu, 2020-11-12 at 17:04 +0200, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote:
> > On 11/12/20 6:40 AM, Peter Lieven wrote:
> > 
> > > >          /*
> > > > -         * Avoid that s->sector_next_status becomes unaligned to the 
> > > > source
> > > > -         * request alignment and/or cluster size to avoid unnecessary 
> > > > read
> > > > -         * cycles.
> > > > +         * Avoid that s->sector_next_status becomes unaligned to the
> > > > +         * source/destination request alignment and/or cluster size to 
> > > > avoid
> > > > +         * unnecessary read/write cycles.
> > > >           */
> > > > -        tail = (sector_num - src_cur_offset + n) % 
> > > > s->src_alignment[src_cur];
> > > > +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> > > > +        assert(is_power_of_2(alignment));
> > > > +
> > > > +        tail = (sector_num - src_cur_offset + n) % alignment;
> > > >          if (n > tail) {
> > > >              n -= tail;
> > > >          }
> > > 
> > > I was also considering including the s->alignment when adding this 
> > > chance. However, you need the least common multiple of both alignments, 
> > > not the maximum, otherwise
> > > 
> > > you might get misaligned to either source or destination.
> > > 
> > > 
> > > Why exactly do you need the power of two requirement?
> > 
> > The power of two requirement ensures that you h ave the least common
> > multiple of both alignments ;)
> -
> Yes, and in addition to that both alignments are already power of two because 
> we align them
> to it. The assert I added is just in case.
> 
> This is how we calculate the destination alignment:
>  
> s.alignment = MAX(pow2floor(s.min_sparse),
>                       DIV_ROUND_UP(out_bs->bl.request_alignment,
>                                    BDRV_SECTOR_SIZE));
>  
>  
> This is how we calculate the source alignments (it is possible to have 
> several source files)
>  
> s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment,
>                                              BDRV_SECTOR_SIZE);
> if (!bdrv_get_info(src_bs, &bdi)) {
>     s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / 
> BDRV_SECTOR_SIZE);
> }
> 
> 
> The bl.request_alignment comment mentions that it must be power of 2,
> and cluster sizes are also very likely to be power of 2 in all drivers
> we support. An assert for s.src_alignment won't hurt though.
> 
>  
> Note though that we don't really read the discard alignment.
> We have max_pdiscard, and pdiscard_alignment, but max_pdiscard
> is only used to split 'too large' discard requests, and pdiscard_alignment
> is advisory and used only in a couple of places.
> Neither are set by file-posix.
>  
> We do have request_alignment, which file-posix tries to align to the logical 
> block
> size which is still often 512 for backward compatibility on many devices 
> (even nvme)
>  
> Now both source and destination alignments in qemu-img convert are based on 
> request_aligment
> and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter
> (which otherwise controls the minimum number of blocks to be zero, to consider
> discarding them in convert)
>  
> 
> This means that there is no guarantee to have 4K alignment, and besides,
> with sufficiently fragmented source file, the bdrv_block_status_above
> can return arbitrary short and unaligned extents which can't be aligned, 
> thus as I said this patch alone can't guarantee that we won't have 
> write and discard sharing the same page.
>  
> Another thing that can be discussed is why is_allocated_sectors was patched
> to convert short discards to writes. Probably because underlying hardware
> ignores them or something? In theory we should detect this and fail
> back to regular zeroing in this case.
> Again though, while in case of converting an empty file, this is the only
> source of writes, and eliminating it, also 'fixes' this case, with 
> sufficiently
> fragmented source file we can even without this code get a write and discard
> sharing a page.
> 
> 
> Best regards,
>       Maxim Levitsky
> 
> > However, there ARE devices in practice that have advertised a
> > non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
> > 63188c24).  Which means you probably don't want to assert power-of-2,
> > and in turn need to worry about least common multiple.
> > 

Any update on this patch?

Best regards,
        Maxim Levitsky




reply via email to

[Prev in Thread] Current Thread [Next in Thread]