qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] migration/calc-dirty-rate: millisecond precision period


From: gudkov.andrei
Subject: Re: [PATCH] migration/calc-dirty-rate: millisecond precision period
Date: Tue, 1 Aug 2023 17:55:29 +0300

On Mon, Jul 31, 2023 at 04:06:24PM -0400, Peter Xu wrote:
> Hi, Andrei,
> 
> On Mon, Jul 31, 2023 at 05:51:49PM +0300, gudkov.andrei@huawei.com wrote:
> > On Mon, Jul 17, 2023 at 03:08:37PM -0400, Peter Xu wrote:
> > > On Tue, Jul 11, 2023 at 03:38:18PM +0300, gudkov.andrei@huawei.com wrote:
> > > > On Thu, Jul 06, 2023 at 03:23:43PM -0400, Peter Xu wrote:
> > > > > On Thu, Jun 29, 2023 at 11:59:03AM +0300, Andrei Gudkov wrote:
> > > > > > Introduces alternative argument calc-time-ms, which is the
> > > > > > the same as calc-time but accepts millisecond value.
> > > > > > Millisecond precision allows to make predictions whether
> > > > > > migration will succeed or not. To do this, calculate dirty
> > > > > > rate with calc-time-ms set to max allowed downtime, convert
> > > > > > measured rate into volume of dirtied memory, and divide by
> > > > > > network throughput. If the value is lower than max allowed
> > > > > > downtime, then migration will converge.
> > > > > > 
> > > > > > Measurement results for single thread randomly writing to
> > > > > > a 24GiB region:
> > > > > > +--------------+--------------------+
> > > > > > | calc-time-ms | dirty-rate (MiB/s) |
> > > > > > +--------------+--------------------+
> > > > > > |          100 |               1880 |
> > > > > > |          200 |               1340 |
> > > > > > |          300 |               1120 |
> > > > > > |          400 |               1030 |
> > > > > > |          500 |                868 |
> > > > > > |          750 |                720 |
> > > > > > |         1000 |                636 |
> > > > > > |         1500 |                498 |
> > > > > > |         2000 |                423 |
> > > > > > +--------------+--------------------+
> > > > > 
> > > > > Do you mean the dirty workload is constant?  Why it differs so much 
> > > > > with
> > > > > different calc-time-ms?
> > > > 
> > > > Workload is as constant as it could be. But the naming is misleading.
> > > > What is named "dirty-rate" in fact is not "rate" at all.
> > > > calc-dirty-rate measures number of *uniquely* dirtied pages, i.e. each
> > > > page can contribute to the counter only once during measurement period.
> > > > That's why the values are decreasing. Consider also ad infinitum 
> > > > argument:
> > > > since VM has fixed number of pages and each page can be dirtied only 
> > > > once,
> > > > dirty-rate=number-of-dirtied-pages/calc-time -> 0 as calc-time -> inf.
> > > > It would make more sense to report number as "dirty-volume" --
> > > > without dividing it by calc-time.
> > > > 
> > > > Note that number of *uniquely* dirtied pages in given amount of time is
> > > > exactly what we need for doing migration-related predictions. There is
> > > > no error here.
> > > 
> > > Is calc-time-ms the duration of the measurement?
> > > 
> > > Taking the 1st line as example, 1880MB/s * 0.1s = 188MB.
> > > For the 2nd line, 1340MB/s * 0.2s = 268MB.
> > > Even for the longest duration of 2s, that's 846MB in total.
> > > 
> > > The range is 24GB.  In this case, most of the pages should only be written
> > > once even if random for all these test durations, right?
> > > 
> > 
> > Yes, I messed with load generator.
> > The effective memory region was much smaller than 24GiB.
> > I performed more testing (after fixing load generator),
> > now with different memory sizes and different modes.
> > 
> > +--------------+-----------------------------------------------+
> > | calc-time-ms |                dirty rate MiB/s               |
> > |              +----------------+---------------+--------------+
> > |              | theoretical    | page-sampling | dirty-bitmap |
> > |              | (at 3M wr/sec) |               |              |
> > +--------------+----------------+---------------+--------------+
> > |                             1GiB                             |
> > +--------------+----------------+---------------+--------------+
> > |          100 |           6996 |          7100 |         3192 |
> > |          200 |           4606 |          4660 |         2655 |
> > |          300 |           3305 |          3280 |         2371 |
> > |          400 |           2534 |          2525 |         2154 |
> > |          500 |           2041 |          2044 |         1871 |
> > |          750 |           1365 |          1341 |         1358 |
> > |         1000 |           1024 |          1052 |         1025 |
> > |         1500 |            683 |           678 |          684 |
> > |         2000 |            512 |           507 |          513 |
> > +--------------+----------------+---------------+--------------+
> > |                             4GiB                             |
> > +--------------+----------------+---------------+--------------+
> > |          100 |          10232 |          8880 |         4070 |
> > |          200 |           8954 |          8049 |         3195 |
> > |          300 |           7889 |          7193 |         2881 |
> > |          400 |           6996 |          6530 |         2700 |
> > |          500 |           6245 |          5772 |         2312 |
> > |          750 |           4829 |          4586 |         2465 |
> > |         1000 |           3865 |          3780 |         2178 |
> > |         1500 |           2694 |          2633 |         2004 |
> > |         2000 |           2041 |          2031 |         1789 |
> > +--------------+----------------+---------------+--------------+
> > |                             24GiB                            |
> > +--------------+----------------+---------------+--------------+
> > |          100 |          11495 |          8640 |         5597 |
> > |          200 |          11226 |          8616 |         3527 |
> > |          300 |          10965 |          8386 |         2355 |
> > |          400 |          10713 |          8370 |         2179 |
> > |          500 |          10469 |          8196 |         2098 |
> > |          750 |           9890 |          7885 |         2556 |
> > |         1000 |           9354 |          7506 |         2084 |
> > |         1500 |           8397 |          6944 |         2075 |
> > |         2000 |           7574 |          6402 |         2062 |
> > +--------------+----------------+---------------+--------------+
> > 
> > Theoretical values are computed according to the following formula:
> > size * (1 - (1-(4096/size))^(time*wps)) / (time * 2^20),
> 
> Thanks for more testings and the statistics.
> 
> I had a feeling that this formula may or may not be accurate, but that's
> less of an issue here.
> 
> > where size is in bytes, time is in seconds, and wps is number of
> > writes per second (I measured approximately 3000000 on my system).
> > 
> > Theoretical values and values obtained with page-sampling are
> > approximately close (<=25%). Dirty-bitmap values are much lower,
> > likely because the majority of writes cause page faults. Even though
> > dirty-bitmap logic is closer to what is happening during live
> > migration, I still favor page sampling because the latter doesn't
> > impact the performance of VM too much.
> 
> Do you really use page samplings in production?  I don't remember I
> mentioned it anywhere before, but it will provide very wrong number when
> the memory updates has a locality, afaik.  For example, when 4G VM only has
> 1G actively updated, the result can be 25% of reality iiuc, seeing that the
> rest 3G didn't even change.  It works only well with very distributed
> memory updates.
> 

Hmmm, such underestimation looks strange to me. I am willing to test
page-sampling and see whether its quality can be improved. Do you have
any specific suggestions on the application to use as a workload?

If it turns out that page-sampling is not an option, then performance
impact of the dirty-bitmap must be improved somehow. Maybe it makes
sense to split memory into 4GiB chunks and measure dirty page rate
independently for each of the chunks (without enabling page
protections for memory outside of the currently processed chunk).
But the downsides are that 1) total measurement time will increase
proportionally by number of chunks 2) dirty page rate will be
overestimated.

But actually I am still hoping on page sampling. Since my goal is to
roughly predict what can be migrated and what cannot be, I would prefer
to keep predictor as lite as possible, even at the cost of
(overestimation) error.

> > 
> > Whether calc-time < 1sec is meaningful or not depends on the size
> > of memory region with active writes.
> > 1. If we have big VM and writes are evenly spread over the whole
> >    address space, then almost all writes will go into unique pages.
> >    In this case number of dirty pages will grow approximately
> >    linearly with time for small calc-time values.
> > 2. But if memory region with active writes is small enough, then many
> >    writes will go to the same page, and the number of dirty pages
> >    will grow sublinearly even for small calc-time values. Note that
> >    the second scenario can happen even VM RAM is big. For example,
> >    imagine 128GiB VM with in-memory database that is used for reading.
> >    Although VM size is big, the memory region with active writes is
> >    just the application stack.
> 
> No issue here to support small calc-time.  I think as long as it'll be
> worthwhile in any use case I'd be fine with it (rather than working for all
> use cases).  Not a super high bar to maintain the change.
> 
> I copied Yong too, he just volunteered to look after the dirtyrate stuff.
> 
> Thanks,
> 
> -- 
> Peter Xu



reply via email to

[Prev in Thread] Current Thread [Next in Thread]