qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v6 0/4] migration: UFFD write-tracking migration/snapshots


From: Andrey Gruzdev
Subject: Re: [PATCH v6 0/4] migration: UFFD write-tracking migration/snapshots
Date: Tue, 15 Dec 2020 22:52:45 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

On 11.12.2020 18:09, Peter Xu wrote:
On Fri, Dec 11, 2020 at 04:13:02PM +0300, Andrey Gruzdev wrote:
I've also made wr-fault resolution latency measurements, for the case when 
migration
stream is dumped to a file in cached mode.. Should approximately match saving 
to the
file fd directly though I used 'migrate exec:<>' using a hand-written tool.

VM config is 6 vCPUs + 16GB RAM, qcow2 image on Seagate 7200.11 series 1.5TB 
HDD,
snapshot goes to the same disk. Guest is Windows 10.

The test scenario is playing full-HD youtube video in Firefox while saving 
snapshot.

Latency measurement begin/end points are fs/userfaultfd.c:handle_userfault() and
mm/userfaultfd.c:mwriteprotect_range(), respectively. For any faulting page, the
oldest wr-fault timestamp is accounted.

The whole time to take snapshot was ~30secs, file size is around 3GB.
So far seems to be not a very bad picture.. However 16-255msecs range is 
worrying
me a bit, seems it causes audio backend buffer underflows sometimes.


      msecs               : count     distribution
          0 -> 1          : 111755   |****************************************|
          2 -> 3          : 52       |                                        |
          4 -> 7          : 105      |                                        |
          8 -> 15         : 428      |                                        |
         16 -> 31         : 335      |                                        |
         32 -> 63         : 4        |                                        |
         64 -> 127        : 8        |                                        |
        128 -> 255        : 5        |                                        |
Great test!  Thanks for sharing these information.

Yes it's good enough for a 1st version, so it's already better than
functionally work. :)

So did you try your last previous patch to see whether it could improve in some
way?  Again we can gradually optimize upon your current work.

Btw, you reminded me that why not we track all these from kernel? :) That's a
good idea.  So, how did you trace it yourself?  Something like below should
work with bpftrace, but I feel like you were done in some other way, so just
fyi:

         # cat latency.bpf
         kprobe:handle_userfault
         {
                 @start[tid] = nsecs;
         }

         kretprobe:handle_userfault
         {
                 if (@start[tid]) {
                         $delay = nsecs - @start[tid];
                         delete(@start[tid]);
                         @delay_us = hist($delay / 1000);
                 }
         }
         # bpftrace latency.bpf

Tracing return of handle_userfault() could be more accurate in that it also
takes the latency between UFFDIO_WRITEPROTECT until vcpu got waked up again.
However it's inaccurate because after a recent change to this code path in
commit f9bf352224d7 ("userfaultfd: simplify fault handling", 2020-08-03)
handle_userfault() could return even before page fault resolved.  However it
should be good enough in most cases because even if it happens, it'll fault
into handle_userfault() again, then we just got one more count.

Thanks!

Peter, thanks for idea, now I've also tried with kretprobe, for Windows 10
and Ubuntu 20.04 guests, two runs for each. Windows is ugly here(

First are series of runs without scan-rate-limiting.patch.
Windows 10:

     msecs               : count     distribution
         0 -> 1          : 131913   |****************************************|
         2 -> 3          : 106      |                                        |
         4 -> 7          : 362      |                                        |
         8 -> 15         : 619      |                                        |
        16 -> 31         : 28       |                                        |
        32 -> 63         : 1        |                                        |
        64 -> 127        : 2        |                                        |


     msecs               : count     distribution
         0 -> 1          : 199273   |****************************************|
         2 -> 3          : 190      |                                        |
         4 -> 7          : 425      |                                        |
         8 -> 15         : 927      |                                        |
        16 -> 31         : 69       |                                        |
        32 -> 63         : 3        |                                        |
        64 -> 127        : 16       |                                        |
       128 -> 255        : 2        |                                        |

Ubuntu 20.04:

     msecs               : count     distribution
         0 -> 1          : 104954   |****************************************|
         2 -> 3          : 9        |                                        |

     msecs               : count     distribution
         0 -> 1          : 147159   |****************************************|
         2 -> 3          : 13       |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |                                        |


Here are runs with scan-rate-limiting.patch.
Windows 10:

     msecs               : count     distribution
         0 -> 1          : 234492   |****************************************|
         2 -> 3          : 66       |                                        |
         4 -> 7          : 219      |                                        |
         8 -> 15         : 109      |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |                                        |

     msecs               : count     distribution
         0 -> 1          : 183171   |****************************************|
         2 -> 3          : 109      |                                        |
         4 -> 7          : 281      |                                        |
         8 -> 15         : 444      |                                        |
        16 -> 31         : 3        |                                        |
        32 -> 63         : 1        |                                        |

Ubuntu 20.04:

     msecs               : count     distribution
         0 -> 1          : 92224    |****************************************|
         2 -> 3          : 9        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 1        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |                                        |

     msecs               : count     distribution
         0 -> 1          : 97021    |****************************************|
         2 -> 3          : 7        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 1        |                                        |

So, initial variant of rate-limiting makes some positive effect, but not very
noticible. Interesting is the case of Windows guest, why the difference is so 
large,
compared to Linux. The reason (theoretically) might be some of virtio or QXL 
drivers,
hard to say. At least Windows VM has been configured with a set of Hyper-V
enlightments, there's nothing to improve in domain config.

For Linux guests latencies are good enough without any additional efforts.

Also, I've missed some code to deal with snapshotting of suspended guest, so 
I'll
make v7 series with that fix and also try to add more effective solution to 
reduce
millisecond-grade latencies.

And yes, I've used bpftrace-like tool - BCC from iovisor with python frontend. 
Seems a bit more
friendly then bpftrace.

--
Andrey Gruzdev, Principal Engineer
Virtuozzo GmbH  +7-903-247-6397
                virtuzzo.com




reply via email to

[Prev in Thread] Current Thread [Next in Thread]