qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] qemu-img: align next status sector on destination alignm


From: Maxim Levitsky
Subject: Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.
Date: Thu, 12 Nov 2020 17:04:24 +0200
User-agent: Evolution 3.36.3 (3.36.3-1.fc32)

On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote:
> On 11/12/20 6:40 AM, Peter Lieven wrote:
> 
> > >          /*
> > > -         * Avoid that s->sector_next_status becomes unaligned to the 
> > > source
> > > -         * request alignment and/or cluster size to avoid unnecessary 
> > > read
> > > -         * cycles.
> > > +         * Avoid that s->sector_next_status becomes unaligned to the
> > > +         * source/destination request alignment and/or cluster size to 
> > > avoid
> > > +         * unnecessary read/write cycles.
> > >           */
> > > -        tail = (sector_num - src_cur_offset + n) % 
> > > s->src_alignment[src_cur];
> > > +        alignment = MAX(s->src_alignment[src_cur], s->alignment);
> > > +        assert(is_power_of_2(alignment));
> > > +
> > > +        tail = (sector_num - src_cur_offset + n) % alignment;
> > >          if (n > tail) {
> > >              n -= tail;
> > >          }
> > 
> > I was also considering including the s->alignment when adding this chance. 
> > However, you need the least common multiple of both alignments, not the 
> > maximum, otherwise
> > 
> > you might get misaligned to either source or destination.
> > 
> > 
> > Why exactly do you need the power of two requirement?
> 
> The power of two requirement ensures that you h ave the least common
> multiple of both alignments ;)
-
Yes, and in addition to that both alignments are already power of two because 
we align them
to it. The assert I added is just in case.

This is how we calculate the destination alignment:
 
s.alignment = MAX(pow2floor(s.min_sparse),
                      DIV_ROUND_UP(out_bs->bl.request_alignment,
                                   BDRV_SECTOR_SIZE));
 
 
This is how we calculate the source alignments (it is possible to have several 
source files)
 
s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment,
                                             BDRV_SECTOR_SIZE);
if (!bdrv_get_info(src_bs, &bdi)) {
    s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / 
BDRV_SECTOR_SIZE);
}


The bl.request_alignment comment mentions that it must be power of 2,
and cluster sizes are also very likely to be power of 2 in all drivers
we support. An assert for s.src_alignment won't hurt though.

 
Note though that we don't really read the discard alignment.
We have max_pdiscard, and pdiscard_alignment, but max_pdiscard
is only used to split 'too large' discard requests, and pdiscard_alignment
is advisory and used only in a couple of places.
Neither are set by file-posix.
 
We do have request_alignment, which file-posix tries to align to the logical 
block
size which is still often 512 for backward compatibility on many devices (even 
nvme)
 
Now both source and destination alignments in qemu-img convert are based on 
request_aligment
and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter
(which otherwise controls the minimum number of blocks to be zero, to consider
discarding them in convert)
 

This means that there is no guarantee to have 4K alignment, and besides,
with sufficiently fragmented source file, the bdrv_block_status_above
can return arbitrary short and unaligned extents which can't be aligned, 
thus as I said this patch alone can't guarantee that we won't have 
write and discard sharing the same page.
 
Another thing that can be discussed is why is_allocated_sectors was patched
to convert short discards to writes. Probably because underlying hardware
ignores them or something? In theory we should detect this and fail
back to regular zeroing in this case.
Again though, while in case of converting an empty file, this is the only
source of writes, and eliminating it, also 'fixes' this case, with sufficiently
fragmented source file we can even without this code get a write and discard
sharing a page.


Best regards,
        Maxim Levitsky

> 
> However, there ARE devices in practice that have advertised a
> non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
> 63188c24).  Which means you probably don't want to assert power-of-2,
> and in turn need to worry about least common multiple.
> 





reply via email to

[Prev in Thread] Current Thread [Next in Thread]