qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v9 2/3] cpu-throttle: implement vCPU throttle


From: Hyman
Subject: Re: [PATCH v9 2/3] cpu-throttle: implement vCPU throttle
Date: Wed, 8 Dec 2021 23:36:32 +0800
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.3.2



在 2021/12/6 18:10, Peter Xu 写道:
On Fri, Dec 03, 2021 at 09:39:46AM +0800, huangy81@chinatelecom.cn wrote:
+static uint64_t dirtylimit_pct(unsigned int last_pct,
+                               uint64_t quota,
+                               uint64_t current)
+{
+    uint64_t limit_pct = 0;
+    RestrainPolicy policy;
+    bool mitigate = (quota > current) ? true : false;
+
+    if (mitigate && ((current == 0) ||
+        (last_pct <= DIRTYLIMIT_THROTTLE_SLIGHT_STEP_SIZE))) {
+        return 0;
+    }
+
+    policy = dirtylimit_policy(last_pct, quota, current);
+    switch (policy) {
+    case RESTRAIN_SLIGHT:
+        /* [90, 99] */
+        if (mitigate) {
+            limit_pct =
+                last_pct - DIRTYLIMIT_THROTTLE_SLIGHT_STEP_SIZE;
+        } else {
+            limit_pct =
+                last_pct + DIRTYLIMIT_THROTTLE_SLIGHT_STEP_SIZE;
+
+            limit_pct = MIN(limit_pct, CPU_THROTTLE_PCT_MAX);
+        }
+       break;
+    case RESTRAIN_HEAVY:
+        /* [75, 90) */
+        if (mitigate) {
+            limit_pct =
+                last_pct - DIRTYLIMIT_THROTTLE_HEAVY_STEP_SIZE;
+        } else {
+            limit_pct =
+                last_pct + DIRTYLIMIT_THROTTLE_HEAVY_STEP_SIZE;
+
+            limit_pct = MIN(limit_pct,
+                DIRTYLIMIT_THROTTLE_SLIGHT_WATERMARK);
+        }
+       break;
+    case RESTRAIN_RATIO:
+        /* [0, 75) */
+        if (mitigate) {
+            if (last_pct <= (((quota - current) * 100 / quota))) {
+                limit_pct = 0;
+            } else {
+                limit_pct = last_pct -
+                    ((quota - current) * 100 / quota);
+                limit_pct = MAX(limit_pct, CPU_THROTTLE_PCT_MIN);
+            }
+        } else {
+            limit_pct = last_pct +
+                ((current - quota) * 100 / current);
+
+            limit_pct = MIN(limit_pct,
+                DIRTYLIMIT_THROTTLE_HEAVY_WATERMARK);
+        }
+       break;
+    case RESTRAIN_KEEP:
+    default:
+       limit_pct = last_pct;
+       break;
+    }
+
+    return limit_pct;
+}
+
+static void *dirtylimit_thread(void *opaque)
+{
+    int cpu_index = *(int *)opaque;
+    uint64_t quota_dirtyrate, current_dirtyrate;
+    unsigned int last_pct = 0;
+    unsigned int pct = 0;
+
+    rcu_register_thread();
+
+    quota_dirtyrate = dirtylimit_quota(cpu_index);
+    current_dirtyrate = dirtylimit_current(cpu_index);
+
+    pct = dirtylimit_init_pct(quota_dirtyrate, current_dirtyrate);
+
+    do {
+        trace_dirtylimit_impose(cpu_index,
+            quota_dirtyrate, current_dirtyrate, pct);
+
+        last_pct = pct;
+        if (pct == 0) {
+            sleep(DIRTYLIMIT_CALC_PERIOD_TIME_S);
+        } else {
+            dirtylimit_check(cpu_index, pct);
+        }
+
+        quota_dirtyrate = dirtylimit_quota(cpu_index);
+        current_dirtyrate = dirtylimit_current(cpu_index);
+
+        pct = dirtylimit_pct(last_pct, quota_dirtyrate, current_dirtyrate);

So what I had in mind is we can start with an extremely simple version of
negative feedback system.  Say, firstly each vcpu will have a simple number to
sleep for some interval (this is ugly code, but just show what I meant..):

===============
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index eecd8031cf..c320fd190f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2932,6 +2932,8 @@ int kvm_cpu_exec(CPUState *cpu)
              trace_kvm_dirty_ring_full(cpu->cpu_index);
              qemu_mutex_lock_iothread();
              kvm_dirty_ring_reap(kvm_state);
+            if (dirtylimit_enabled(cpu->cpu_index) && 
cpu->throttle_us_per_full)
+                usleep(cpu->throttle_us_per_full);
              qemu_mutex_unlock_iothread();
              ret = 0;
              break;
===============

I think this will have finer granularity when throttle (for 4096 ring size,
that's per-16MB operation) than current way where we inject per-vcpu async task
to sleep, like auto-converge.

Then we have the "black box" to tune this value with below input/output:

   - Input: dirty rate information, same as current algo

   - Output: increase/decrease of per-vcpu throttle_us_per_full above, and
     that's all

We can do the sampling per-second, then we keep doing it: we can have 1 thread
doing per-second task collecting dirty rate information for all the vcpus, then
tune that throttle_us_per_full for each of them.

The simplest linear algorithm would be as simple as (for each vcpu):

   if (quota < current)
     throttle_us_per_full += SOMETHING;
     if (throttle_us_per_full > MAX)
       throttle_us_per_full = MAX;
   else
     throttle_us_per_full -= SOMETHING;
     if (throttle_us_per_full < 0)
       throttle_us_per_full = 0;

I think your algorithm is fine, but thoroughly review every single bit of it in
one shot will be challenging, and it's also hard to prove every bit of the
algorithm is helpful, as there're a lot of hand-made macros and state changes.

I actually tested the current algorithm of yours, the dirty rate fluctuates a
bit (when I specified 200MB/s, it can go into either a few tens of MB/s or
300MB/s, normally less), neither does it respond fast (the initial throtle from
500MB/s -> 200MB/s should need 1 minute or something), so it seems not ideal
anyway. In that case I prefer we start with simple.

So IMHO we can start with this simple scheme first then it'll start working
with much less line of codes, afaict.  With that scheme ready in the 1st or
initial patches, it'll be easier to either apply any better algorithm
(e.g. your current one, if you're confident with that) or other things then
it'll be much easier to review too if you could consider split your patch like
that.

Normally per my knowledge for the need on migration, we could consider add an
integral algorithm into this linear algorithm that I said above, and it should
help us reach a very stable and constant state of throttling already.  But
we'll need to try it out, as I never tried.

What do you think?

I absolutely agree with your point, negative feedback system is also what i thought in the first place, and theoretically may be the most appropriate algo to control the vcpu in a stable dirty page rate from my point of view, but at the very beginning i'm not sure the new algo of throttling can be accepted, so i adopted the exiting auto-converge algo in qemu... :). One of my purposes of posting this patchset is for the sake of RFC, and thanks Peter very much for giving the advice.

I'll try it out and see the results. If things go well, the negative feedback system to control the dirty page rate for a vcpu will be introduced next version.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]