On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
09.02.2021 16:25, Max Reitz wrote:
On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
I know, I have several series waiting for a resend, but I had to
to another task spawned from our customer's bug.
Original problem: we use O_DIRECT for all vm images in our
the policy. The only exclusion is backup target qcow2 image for
compressed backup, because compressed backup is extremely slow with
O_DIRECT (due to unaligned writes). Customer complains that backup
produces a lot of pagecache.
So we can either implement some internal cache or use fadvise
Backup has several async workes, which writes simultaneously, so
ways we have to track host cluster filling (before dropping the
corresponding to the cluster). So, if we have to track anyway,
try to implement the cache.
I wanted to be excited here, because that sounds like it would be
very easy to implement caching. Like, just keep the cluster at
free_byte_offset cached until the cluster it points to changes,
then flush the cluster.
The problem is that chunks are written asynchronously.. That's why
this all is not so easy.
But then I see like 900 new lines of code, and I’m much less
Idea is simple: cache small unaligned write and flush the cluster
Performance result is very good (results in a table is time of
compressed backup of 1000M disk filled with ones in seconds):
“Filled with ones” really is an edge case, though.
Yes, I think, all clusters are compressed to rather small chunks :)
--------------- ----------- -----------
ssd:hdd(direct) 3e+02 4.4
ssd:hdd(cached) 5.7 5.4
--------------- ----------- -----------
So, we have benefit even for cached mode! And the fastest thing is
O_DIRECT with new implemented cache. So, I suggest to enable the
cache by default (which is done by the series).
First, I’m not sure how O_DIRECT really is relevant, because I
don’t really see the point for writing compressed images.
compressed backup is a point
(Perhaps irrelevant, but just to be clear:) I meant the point of
using O_DIRECT, which one can decide to not use for backup targets
(as you have done already).
Second, I find it a bit cheating if you say there is a huge
improvement for the no-cache case, when actually, well, you just
added a cache. So the no-cache case just became faster because
there is a cache now.
Still, performance comparison is relevant to show that O_DIRECT as
is unusable for compressed backup.
(Again, perhaps irrelevant, but:) Yes, but my first point was
exactly whether O_DIRECT is even relevant for writing compressed
Well, I suppose I could follow that if O_DIRECT doesn’t make much
sense for compressed images, qemu’s format drivers are free to
introduce some caching (because technically the cache.direct
option only applies to the protocol driver) for collecting
Yes I thought in this way, enabling the cache by default.
That conclusion makes both of my complaints kind of moot.
Third, what is the real-world impact on the page cache? You
described that that’s the reason why you need the cache in qemu,
because otherwise the page cache is polluted too much. How much
is the difference really? (I don’t know how good the compression
ratio is for real-world images.)
Hm. I don't know the ratio.. Customer reported that most of RAM is
polluted by Qemu's cache, and we use O_DIRECT for everything except
for target of compressed backup.. Still the pollution may relate to
several backups and of course it is simple enough to drop the cache
after each backup. But I think that even one backup of 16T disk may
pollute RAM enough.
Oh, sorry, I just realized I had a brain fart there. I was
referring to whether this series improves the page cache pollution.
But obviously it will if it allows you to re-enable O_DIRECT.
Related to that, I remember a long time ago we had some discussion
about letting qemu-img convert set a special cache mode for the
target image that would make Linux drop everything before the last
offset written (i.e., I suppose fadvise() with
POSIX_FADV_SEQUENTIAL). You discard that idea based on the fact
that implementing a cache in qemu would be simple, but it isn’t,
really. What would the impact of POSIX_FADV_SEQUENTIAL be? (One
advantage of using that would be that we could reuse it for
non-compressed images that are written by backup or qemu-img
The problem is that writes are async. And therefore, not sequential.
In theory, yes, but all compressed writes still goes through
qcow2_alloc_bytes() right before submitting the write, so I wonder
whether in practice the writes aren’t usually sufficiently
sequential to make POSIX_FADV_SEQUENTIAL work fine.
I have to track the writes and wait until the whole cluster is
filled. It's simple use fadvise as an option to my cache: instead
of caching data and write when cluster is filled we can instead
mark cluster POSIX_FADV_DONTNEED.
(I don’t remember why that qemu-img discussion died back then.)
Fourth, regarding the code, would it be simpler if it were a pure
write cache? I.e., on read, everything is flushed, so we don’t
have to deal with that. I don’t think there are many valid cases
where a compressed image is both written to and read from at the
same time. (Just asking, because I’d really want this code to be
simpler. I can imagine that reading from the cache is the least
bit of complexity, but perhaps...)
Hm. I really didn't want to support reads, and do it only to make
it possible to enable the cache by default.. Still read function is
really simple, and I don't think that dropping it will simplify the
That’s too bad.
So the only question I have left is what POSIX_FADV_SEQUENTIAL
actually would do in practice.