qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_


From: Emilio G. Cota
Subject: Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for cpu_exec_start/end
Date: Wed, 21 Sep 2016 13:24:44 -0400
User-agent: Mutt/1.5.23 (2014-03-12)

On Mon, Sep 19, 2016 at 14:50:59 +0200, Paolo Bonzini wrote:
> Set cpu->running without taking the cpu_list lock, only look at it if
> there is a concurrent exclusive section.  This requires adding a new
> field to CPUState, which records whether a running CPU is being counted
> in pending_cpus.  When an exclusive section is started concurrently with
> cpu_exec_start, cpu_exec_start can use the new field to wait for the end
> of the exclusive section.
> 
> This a separate patch for easier bisection of issues.
> 
> Signed-off-by: Paolo Bonzini <address@hidden>
> ---
>  cpus-common.c              | 73 
> ++++++++++++++++++++++++++++++++++++++++------
>  docs/tcg-exclusive.promela | 53 +++++++++++++++++++++++++++++++--
>  include/qom/cpu.h          |  5 ++--
>  3 files changed, 117 insertions(+), 14 deletions(-)
> 
> diff --git a/cpus-common.c b/cpus-common.c
> index f7ad534..46cf8ef 100644
> --- a/cpus-common.c
> +++ b/cpus-common.c
> @@ -184,8 +184,12 @@ void start_exclusive(void)
>  
>      /* Make all other cpus stop executing.  */
>      pending_cpus = 1;
> +
> +    /* Write pending_cpus before reading other_cpu->running.  */
> +    smp_mb();
>      CPU_FOREACH(other_cpu) {
>          if (other_cpu->running) {
> +            other_cpu->has_waiter = true;
>              pending_cpus++;
>              qemu_cpu_kick(other_cpu);
>          }
> @@ -212,24 +216,75 @@ void end_exclusive(void)
>  /* Wait for exclusive ops to finish, and begin cpu execution.  */
>  void cpu_exec_start(CPUState *cpu)
>  {
> -    qemu_mutex_lock(&qemu_cpu_list_mutex);
> -    exclusive_idle();
>      cpu->running = true;
> -    qemu_mutex_unlock(&qemu_cpu_list_mutex);
> +
> +    /* Write cpu->running before reading pending_cpus.  */
> +    smp_mb();
> +
> +    /* 1. start_exclusive saw cpu->running == true and pending_cpus >= 1.
> +     * After taking the lock we'll see cpu->has_waiter == true and run---not
> +     * for long because start_exclusive kicked us.  cpu_exec_end will
> +     * decrement pending_cpus and signal the waiter.
> +     *
> +     * 2. start_exclusive saw cpu->running == false but pending_cpus >= 1.
> +     * This includes the case when an exclusive item is running now.
> +     * Then we'll see cpu->has_waiter == false and wait for the item to
> +     * complete.
> +     *
> +     * 3. pending_cpus == 0.  Then start_exclusive is definitely going to
> +     * see cpu->running == true, and it will kick the CPU.
> +     */
> +    if (pending_cpus) {
> +        qemu_mutex_lock(&qemu_cpu_list_mutex);
> +        if (!cpu->has_waiter) {
> +            /* Not counted in pending_cpus, let the exclusive item
> +             * run.  Since we have the lock, set cpu->running to true
> +             * while holding it instead of retrying.
> +             */
> +            cpu->running = false;
> +            exclusive_idle();
> +            /* Now pending_cpus is zero.  */
> +            cpu->running = true;
> +        } else {
> +            /* Counted in pending_cpus, go ahead.  */
> +        }
> +        qemu_mutex_unlock(&qemu_cpu_list_mutex);
> +    }

wrt scenario (3): I don't think other threads will always see cpu->running == 
true.
Consider the following:

cpu0                                    cpu1
----                                    ----

cpu->running = true;                    pending_cpus = 1;
smp_mb();                               smp_mb();
if (pending_cpus) { /* false */ }       CPU_FOREACH(other_cpu) { if 
(other_cpu->running) { /* false */ } }

The barriers here don't guarantee that changes are immediately visible to others
(for that we need strong ops, i.e. atomics).
So in the example above, pending_cpus has been set to 1, but it might not
yet be visible by cpu0. The same thing applies to cpu0->running; despite
the barrier, cpu1 might not yet perceive it, and could therefore miss kicking
cpu0 (and proceed while cpu0 executes).

Is there a performance (scalability) reason behind this patch? I can only
think of a guest with many frequent atomics, which would be very slow. However,
once the cmpxchg patchset goes in, those atomics will be emulated without
leaving the CPU loop.

If we want this to scale better without complicating things too much,
I'd focus on converting the exclusive_resume broadcast into a signal,
so that we avoid the thundering herd problem. Not clear to me what workloads
would contend on start/end_exclusive though.

Thanks,

                E.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]