qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster


From: Dave Chinner
Subject: Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster
Date: Mon, 24 Aug 2020 07:59:07 +1000

On Fri, Aug 21, 2020 at 02:12:32PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@redhat.com> 
> > wrote:
> >>> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >>> >         with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >>> > 
> >>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >>> >         of the cluster with zeroes.
> >>> > 
> >>> > 3) metadata: all clusters were allocated when the image was created
> >>> >         but they are sparse, QEMU only writes the 4KB of data.
> >>> > 
> >>> > 4) falloc: all clusters were allocated with fallocate() when the image
> >>> >         was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > 5) full: all clusters were allocated by writing zeroes to all of them
> >>> >         when the image was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > As I said in a previous message I'm not familiar with xfs, but the
> >>> > parts that I don't understand are
> >>> > 
> >>> >    - Why is (4) slower than (1)?
> >>> 
> >>> Because fallocate() is a full IO serialisation barrier at the
> >>> filesystem level. If you do:
> >>> 
> >>> fallocate(whole file)
> >>> <IO>
> >>> <IO>
> >>> <IO>
> >>> .....
> >>> 
> >>> The IO can run concurrent and does not serialise against anything in
> >>> the filesysetm except unwritten extent conversions at IO completion
> >>> (see answer to next question!)
> >>> 
> >>> However, if you just use (4) you get:
> >>> 
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>>   ....
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   ....
> >>>   <4k IO completes, converts 4k to written>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>> falloc(64k)
> >>>   <wait for inflight IO to complete>
> >>>   ....
> >>>   <4k IO completes, converts 4k to written>
> >>>   <allocates 64k as unwritten>
> >>> <4k io>
> >>>   ....
> >>> 
> >>
> >> Option 4 is described above as initial file preallocation whereas
> >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> >> is reporting that the initial file preallocation mode is slower than
> >> the per cluster prealloc mode. Berto, am I following that right?
> 
> After looking more closely at the data I can see that there is a peak of
> ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
> ~7K for the rest of the test.

How big is the filesystem, how big is the log? (xfs_info output,
please!)

In general, there are three typical causes of this. The first is
typical of the initial burst of allocations running on an empty
journal, then allocation transactions getting throttling back to the
speed at which metadata can be flushed once the journal fills up. If
you have a small filesystem and a default sized log, this is quite
likely to happen.

The second is that have large logs and you are running on hardware
where device cache flushes and FUA writes hammer overall device
performance. Hence when the CIL initially fills up and starts
flushing (journal writes are pre-flush + FUA so do both) device
performance goes way down because now it has to write it's cached
data to physical media rather than just cache it in volatile device
RAM. IOWs, journal writes end up forcing all volatile data to stable
media and so that can slow the device down. ALso, cache flushes
might not be queued commands, hence journal writes will also create IO
pipeline stalls...

The third is the hardware capability.  Consumer hardware is designed
to have extremely fast bursty behaviour, but then steady state
performance is much lower (think "SLC" burst caches in TLC SSDs). I
have isome consumer SSDs here that can sustain 400MB/s random 4kB
write for about 10-15s, then they drop to about 50MB/s once the
burst buffer is full. OTOH, I have enterprise SSDs that will sustain
a _much_ higher rate of random 4kB writes indefinitely than the
consumer SSDs burst at.  However, most consumer workloads don't move
this sort of data around, so this sort of design tradeoff is fine
for that market (Benchmarketing 101 stuff :).

IOWs, this behaviour could be filesystem config, it could be cache
flush behaviour, it could simply be storage device design
capability. Or it could be a combination of all three things.
Watching a set of fast sampling metrics that tell you what the
device and filesytem are doing in real time (e.g. I use PCP for this
and visualise ithe behaviour in real time via pmchart) gives a lot
of insight into exactly what is changing during transient workload
changes liek starting a benchmark...

> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> of data in order to let performance settle, but if I remove that I can
> see the effect more clearly. I can observe it with raw files (in 'off'
> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> preallocation=off the performance is stable during the whole test.

What does "preallocation=off" mean again? Is that using
fallocate(ZERO_RANGE) prior to the data write rather than
preallocating the metadata/entire file? If so, I would expect the
limiting factor is the rate at which IO can be issued because of the
fallocate() triggered pipeline bubbles. That leaves idle device time
so you're not pushing the limits of the hardware and hence none of
the behaviours above will be evident...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com



reply via email to

[Prev in Thread] Current Thread [Next in Thread]