[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default

From: Longpeng (Mike)
Subject: Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Date: Thu, 30 Nov 2017 17:26:44 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20120327 Thunderbird/11.0.1

On 2017/11/29 21:35, Roman Kagan wrote:

> On Wed, Nov 29, 2017 at 07:58:19PM +0800, Longpeng (Mike) wrote:
>> On 2017/11/29 18:41, Eduardo Habkost wrote:
>>> On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
>>>> On 2017/11/29 5:13, Eduardo Habkost wrote:
>>>>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
>>>>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
>>>>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
>>>>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for 
>>>>>>>> vcpus"
>>>>>>>> introduced and set by default exposing l3 to the guest.
>>>>>>>> The motivation behind it was that in the Linux scheduler, when waking 
>>>>>>>> up
>>>>>>>> a task on a sibling CPU, the task was put onto the target CPU's 
>>>>>>>> runqueue
>>>>>>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
>>>>>>>> led to performance gain.
>>>>>>>> However, this isn't the whole story.  Once the task is on the target
>>>>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
>>>>>>>> it the idle task putting the CPU to sleep or just another running task.
>>>>>>>> For that a reschedule IPI will have to be issued, too.  Only when that
>>>>>>>> other CPU is running a normal task for too little time, the fairness
>>>>>>>> constraints will prevent the preemption and thus the IPI.
>>>> Agree. :)
>>>> Our testing VM is Suse11 guest with idle=poll at that time and now I 
>>>> realize
>>>> that Suse11 has a BUG in its scheduler.
>>>> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is 
>>>> issued if
>>>> rq->idle is not polling:
>>>> '''
>>>> static void ttwu_queue_remote(struct task_struct *p, int cpu)
>>>> {
>>>>    struct rq *rq = cpu_rq(cpu);
>>>>    if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>>>>            if (!set_nr_if_polling(rq->idle))
>>>>                    smp_send_reschedule(cpu);
>>>>            else
>>>>                    trace_sched_wake_idle_without_ipi(cpu);
>>>>    }
>>>> }
>>>> '''
>>>> But for Suse11, it does not check, it send a RES IPI unconditionally.
>>> So, does that mean no Linux guest benefits from the l3-cache=on
>>> default except SuSE 11 guests?
>> Not only that, there is another scenario:
>> static void ttwu_queue(...)
>> {
>>      if (...two cpus NOT sharing L3-cache) {
>>              ...
>>              ttwu_queue_remote(p, cpu, wake_flags);
>>              return;
>>      }
>>      ...
>>      ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here*
>>      ...
>> }
>> In ttwu_do_activate(), there are also some opportunities with low 
>> probability to
>> do not send RES IPI even if the target cpu isn't in IDLE polling state.
> Well it isn't so low actually, what you need is to keep the cpus busy
> switching tasks.  In that case it's not uncommon that the task being
> woken up on a remote cpu has accumulated more vruntime than the task
> already running on that cpu; in that case the new task won't preempt the
> current task and the IPI won't be issued.  E.g. on a RHEL 7.4 guest we
> saw:

I get it, thanks.

>>>>>>>> This boils down to the improvement being only achievable in workloads
>>>>>>>> with many actively switching tasks.  We had no access to the
>>>>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
>>>>>>>> pattern is also reproduced with "perf bench sched messaging -g 1"
>>>>>>>> on 1 socket, 8 cores vCPU topology, we see indeed:
>>>>>>>> l3-cache       #res IPI /s     #time / 10000 loops
>>>>>>>> off            560K            1.8 sec
>>>>>>>> on             40K             0.9 sec
> The workload where it bites is mostly idle guest, with chains of
> dependent wakeups, i.e. with little parallelism:
>>>>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>>>>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>>>>>>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
>>>>>>>> sched pipe -i 100000" gives
>>>>>>>> l3-cache       #res IPI /s     #HLT /s         #time /100000 loops
>>>>>>>> off            200 (no K)      230             0.2 sec
>>>>>>>> on             400K            330K            0.5 sec
>>>> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.
> Actually, it's SD_WAKE_AFFINE that's being effectively defeated by this
> l3-cache, because the scheduler thinks that the cpus that share the
> last-level cache are close enough that a dependent task can be woken up
> on a sibling cpu.

In this case (sched pipe), without L3-cache, a dependent task is woken up on the
original cpu mostly, if these two tasks ran on the same cpu then the dependent
task is woken up without a RES IPI. The related codes are:
void resched_curr(struct rq *rq)
        if (cpu == smp_processor_id()) {

Do I understand correctly ? If not, hope you could point out what's wrong :)

>>>> As Gonglei said:
>>>> 1. the L3 cache relates to the user experience.
>>> This is true, in a way: I have seen a fair share of user reports
>>> where they incorrectly blame the L3 cache absence or the L3 cache
>>> size for performance problems.
>>>> 2. the glibc would get the cache info by CPUID directly, and relates to the
>>>> memory performance.
>>> I'm interested in numbers that demonstrate that.
> Me too.  I vaguely remember debugging a memcpy degradation in the guest
> (on the Parallels proprietary hypervisor), that turned out being due a
> combination of l3 cache size and the cpu topology exposed to the guest,
> which caused glibc to choose an inadequate buffuer size.

We faced the same problem several months ago.

I did some simple tests at noon, it seems that numbers are better without
L3-cache except 'perf bench sched messaging'.

VM: 1 sockets, 8 cores, 3.10.0 guest
Hardware: Intel(R) Xeon(R) CPU E7-8890 v2 @ 2.80GHz

Stream:(100 turns)
l3    Copy    Scale    Add    Triad
off  8025.8  8019.5  8363.1  8589.9
on   8016.7  7999.9  8344.2  8568.9

perf sched bench message:(100 turns)
l3    Total-time
off   0.0238
on    0.0178

perf sched bench pipe:(100 turns)
l3    Total-time
off   0.3190
on    1.2688

We are so busy at end of each month, maybe my tests is insufficient, I'm sorry
for that.
According the numbers above, I think it's worth to turn off L3-cache by default.

>> Sorry I have no numbers in hand currently :(
>> I'll do some tests these days, please give me some time.
> We'll try to get some data on this, too.
>>>> What's more, the L3 cache relates to the sched_domain which is important 
>>>> to the
>>>> (load) balancer when system is busy.
>>>> All this doesn't mean the patch is insignificant, I just think we should 
>>>> do more
>>>> research before decide. I'll do some tests, thanks. :)
>>> Yes, we need more data.  But if we find out that there are no
>>> cases where the l3-cache=on default actually improves
>>> performance, I will be willing to apply this patch.
>> That's a good thing if we find the truth, it's free. :)
>> OTOH, I think we should notice that: Linux is designed on real hardware, 
>> maybe
>> there're some other problems if QEMU lacks some related features. If we 
>> search
>> 'cpus_share_cache' in the Linux kernel, we can see that it's also used by 
>> Block
>> Layer.
>>> IMO, the long term solution is to make Linux guests not misbehave
>>> when we stop lying about the L3 cache.  Maybe we could provide a
>>> "IPIs are expensive, please avoid them" hint in the KVM CPUID
>>> leaf?
> We already have it, it's the hypervisor bit ;)  Seriously, I'm unaware
> of hypervisors where IPIs aren't expensive.
>> Maybe more PV features could be digged.
> One problem with this is that PV features are hard to get into other
> guest OSes or existing Linux guests.

Some cloud providers (e.g. Amazon,AliBaBa...) provide a customized guest which
could includes more PV features to reach the limiting performance.

> Roman.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]