qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/7] qcow2: compressed write cache


From: Denis V. Lunev
Subject: Re: [PATCH 0/7] qcow2: compressed write cache
Date: Tue, 9 Feb 2021 21:41:00 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1

On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 17:47, Max Reitz wrote:
>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Hi all!
>>>>>>
>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>> switch
>>>>>> to another task spawned from our customer's bug.
>>>>>>
>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>> product, it's
>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>> produces a lot of pagecache.
>>>>>>
>>>>>> So we can either implement some internal cache or use fadvise
>>>>>> somehow.
>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>> in both
>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>> cache
>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>> let's
>>>>>> try to implement the cache.
>>>>>
>>>>> I wanted to be excited here, because that sounds like it would be
>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>> then flush the cluster.
>>>>
>>>> The problem is that chunks are written asynchronously.. That's why
>>>> this all is not so easy.
>>>>
>>>>>
>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>> excited...
>>>>>
>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>> when
>>>>>> filled.
>>>>>>
>>>>>> Performance result is very good (results in a table is time of
>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>
>>>>> “Filled with ones” really is an edge case, though.
>>>>
>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>
>>>>>
>>>>>> ---------------  -----------  -----------
>>>>>>                   backup(old)  backup(new)
>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>                                  -99%
>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>                                  -5%
>>>>>> ---------------  -----------  -----------
>>>>>>
>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>> cache by default (which is done by the series).
>>>>>
>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>> don’t really see the point for writing compressed images.
>>>>
>>>> compressed backup is a point
>>>
>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>> using O_DIRECT, which one can decide to not use for backup targets
>>> (as you have done already).
>>>
>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>> improvement for the no-cache case, when actually, well, you just
>>>>> added a cache.  So the no-cache case just became faster because
>>>>> there is a cache now.
>>>>
>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>> is unusable for compressed backup.
>>>
>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>> exactly whether O_DIRECT is even relevant for writing compressed
>>> images.
>>>
>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>> introduce some caching (because technically the cache.direct
>>>>> option only applies to the protocol driver) for collecting
>>>>> compressed writes.
>>>>
>>>> Yes I thought in this way, enabling the cache by default.
>>>>
>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>
>>>>> *shrug*
>>>>>
>>>>> Third, what is the real-world impact on the page cache?  You
>>>>> described that that’s the reason why you need the cache in qemu,
>>>>> because otherwise the page cache is polluted too much.  How much
>>>>> is the difference really?  (I don’t know how good the compression
>>>>> ratio is for real-world images.)
>>>>
>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>> for target of compressed backup.. Still the pollution may relate to
>>>> several backups and of course it is simple enough to drop the cache
>>>> after each backup. But I think that even one backup of 16T disk may
>>>> pollute RAM enough.
>>>
>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>> referring to whether this series improves the page cache pollution. 
>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>
>>>>> Related to that, I remember a long time ago we had some discussion
>>>>> about letting qemu-img convert set a special cache mode for the
>>>>> target image that would make Linux drop everything before the last
>>>>> offset written (i.e., I suppose fadvise() with
>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>> advantage of using that would be that we could reuse it for
>>>>> non-compressed images that are written by backup or qemu-img
>>>>> convert.)
>>>>
>>>> The problem is that writes are async. And therefore, not sequential.
>>>
>>> In theory, yes, but all compressed writes still goes through
>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>> whether in practice the writes aren’t usually sufficiently
>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>
>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>> the whole backup target before the backup? Will try. Still, I expect
>> that my cache will show better performance anyway. Actually,
>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>> 20% faster, which is significant (still, yes, would be good to check
>> it on more real case than all-ones).
>>
>>>
>>>> So
>>>> I have to track the writes and wait until the whole cluster is
>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>> of caching data and write when cluster is filled we can instead
>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>
>>>>>
>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>
>>>>>
>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>> where a compressed image is both written to and read from at the
>>>>> same time. (Just asking, because I’d really want this code to be
>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>> bit of complexity, but perhaps...)
>>>>>
>>>>
>>>> Hm. I really didn't want to support reads, and do it only to make
>>>> it possible to enable the cache by default.. Still read function is
>>>> really simple, and I don't think that dropping it will simplify the
>>>> code significantly.
>>>
>>> That’s too bad.
>>>
>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>> actually would do in practice.
>>
>> will check.
>>
>
> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
> removed.
>
> Test:
> [root@kvm fadvise]# cat a.c
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <getopt.h>
> #include <string.h>
> #include <stdbool.h>
>
> int main(int argc, char *argv[])
> {
>     int fd;
>     int i;
>     char mb[1024 * 1024];
>     int open_flags = O_RDWR | O_CREAT | O_EXCL;
>     int fadv_flags = 0;
>     int fadv_final_flags = 0;
>     int len = 1024 * 1024;
>     bool do_fsync = false;
>
>     for (i = 1; i < argc - 1; i++) {
>         const char *arg = argv[i];
>
>         if (!strcmp(arg, "direct")) {
>             open_flags |= O_DIRECT;
>         } else if (!strcmp(arg, "seq")) {
>             fadv_flags = POSIX_FADV_SEQUENTIAL;
>         } else if (!strcmp(arg, "dontneed")) {
>             fadv_flags = POSIX_FADV_DONTNEED;
>         } else if (!strcmp(arg, "final-dontneed")) {
>             fadv_final_flags = POSIX_FADV_DONTNEED;
>         } else if (!strcmp(arg, "fsync")) {
>             do_fsync = true;
>         } else {
>             fprintf(stderr, "unknown: %s\n", arg);
>             return 1;
>         }
>     }
>
>     fd = open(argv[argc - 1], open_flags);
>
>     if (fd < 0) {
>         fprintf(stderr, "failed to open\n");
>         return 1;
>     }
>
>     if (fadv_flags) {
>         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>     }
>     for (i = 0; i < 100; i++) {
>         write(fd, mb, len);
>     }
>     if (fadv_final_flags) {
>         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>     }
>     if (do_fsync) {
>         fsync(fd);
>     }
>
>     close(fd);
> }
>
>
>
> [root@kvm fadvise]# gcc a.c
> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>   RES PAGES  SIZE FILE
>   36M  9216  100M x
> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>   RES PAGES  SIZE FILE
>   36M  9216  100M x
> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
> RES PAGES SIZE FILE
>  0B     0   0B x
>
>
>
> Backup-generated pagecache is a formal trash, it will be never used.
> And it's bad that it can displace another good pagecache. So the best
> thing for backup is direct mode + proposed cache.
>
>
I think that the original intention of Max is about POSIX_FADV_DONTNEED
One should call this fadvise for just fully written cluster. Though this is
a bit buggy and from performance point of view will be slower.

This call could be slow and thus it should be created as delegate to
co-routine. We have though on this, but the amount of work to
implement even if seems lower, is not really lower and the result
would not be as great.

Den



reply via email to

[Prev in Thread] Current Thread [Next in Thread]