[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH v2 2/4] spapr/rtas: disable the decrementer interr
From: |
Cédric Le Goater |
Subject: |
Re: [Qemu-ppc] [PATCH v2 2/4] spapr/rtas: disable the decrementer interrupt when a CPU is unplugged |
Date: |
Thu, 12 Oct 2017 11:29:35 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 |
On 10/12/2017 11:25 AM, Cédric Le Goater wrote:
> On 10/12/2017 12:46 AM, David Gibson wrote:
>> On Wed, Oct 11, 2017 at 01:55:20PM +0200, Cédric Le Goater wrote:
>>> On 10/11/2017 08:45 AM, David Gibson wrote:
>>>> On Mon, Oct 09, 2017 at 05:49:28PM +0200, Cédric Le Goater wrote:
>>>>> When a CPU is stopped with the 'stop-self' RTAS call, its state
>>>>> 'halted' is switched to 1 and, in this case, the MSR is not taken into
>>>>> account anymore in the cpu_has_work() routine. Only the pending
>>>>> hardware interrupts are checked with their LPCR:PECE* enablement bit.
>>>>>
>>>>> If the DECR timer fires after 'stop-self' is called and before the CPU
>>>>> 'stop' state is reached, the nearly-dead CPU will have some work to do
>>>>> and the guest will crash. This case happens very frequently with the
>>>>> not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is
>>>>> occasionally fired but after 'stop' state, so no work is to be done
>>>>> and the guest survives.
>>>>>
>>>>> I suspect there is a race between the QEMU mainloop triggering the
>>>>> timers and the TCG CPU thread but I could not quite identify the root
>>>>> cause. To be safe, let's disable the decrementer interrupt in the LPCR
>>>>> when the CPU is halted and reenable it when the CPU is restarted.
>>>>>
>>>>> Signed-off-by: Cédric Le Goater <address@hidden>
>>>>> ---
>>>>>
>>>>> Changes in v2:
>>>>>
>>>>> - used a new routine ppc_cpu_pvr_match() to discriminate CPU versions
>>>>> - removed the LPCR:PECE* enablement bit when the CPU is initialized
>>>>> if it is a secondary
>>>>>
>>>>> hw/ppc/spapr_rtas.c | 20 ++++++++++++++++++++
>>>>> target/ppc/translate_init.c | 19 +++++++++++++++++--
>>>>> 2 files changed, 37 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>>>>> index cdf0b607a0a0..dfdbf1e2c6f8 100644
>>>>> --- a/hw/ppc/spapr_rtas.c
>>>>> +++ b/hw/ppc/spapr_rtas.c
>>>>> @@ -46,6 +46,7 @@
>>>>> #include "qemu/cutils.h"
>>>>> #include "trace.h"
>>>>> #include "hw/ppc/fdt.h"
>>>>> +#include "target/ppc/cpu-models.h"
>>>>>
>>>>> static void rtas_display_character(PowerPCCPU *cpu, sPAPRMachineState
>>>>> *spapr,
>>>>> uint32_t token, uint32_t nargs,
>>>>> @@ -174,6 +175,15 @@ static void rtas_start_cpu(PowerPCCPU *cpu_,
>>>>> sPAPRMachineState *spapr,
>>>>> kvm_cpu_synchronize_state(cs);
>>>>>
>>>>> env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
>>>>> +
>>>>> + /* Enable DECR interrupt */
>>>>> + if (ppc_cpu_pvr_match(cpu, CPU_POWERPC_LOGICAL_3_00)) {
>>>>
>>>> Sorry, I didn't reply to your earlier mail in time. Going via the PVR
>>>> in this way seems bonkers to me - I like it even less than checking
>>>> the mmu type. After all, classifying a bunch of precise models (PVRs)
>>>> together by behaviour is kind of exactly what the CPU classes are for,
>>>> so using object_dynamic_case() (==instance_of) is a better idea here.
>>>
>>> hmm, and which type should I use ? we don't have any TYPE_POWER9* we
>>> could use for a object_dynamic_cast(). I don't think so ? I could use
>>> the name and strcmp("power9") probably but it looks ugly.
>>
>> Actually there is, but, yeah, it's a lot less obvious than I thought.
>> It's constructed by the POWERPC_FAILY macro and will be
>> "POWER9-family-powerpc64-cpu"
>>
>>> The only thing we have is "CPU_POWERPC_POWER9_BASE" and it only
>>> applicates to PVR.
>>>
>>> May be I don't understand your idea.
>>
>> Urgh, sorry. This got much muckier than I thought it would be. I
>> think maybe it's best to go back to the mmu type test, and later on we
>> can fix up both the previously existing test like that, and the new
>> one to something better.
>
> Given that the bits are the same on all processors, why not just use :
grummf, P7 reserves bits 47 and 48.
C.
> env->spr[SPR_LPCR] |= LPCR_PECE_L_MASK;
>
> and
>
> env->spr[SPR_LPCR] &= ~LPCR_PECE_L_MASK;
>
> Thanks,
>
> C.
>
>
>>>>> + env->spr[SPR_LPCR] |= LPCR_DEE;
>>>>> + } else {
>>>>> + /* P7 and P8 both have same bit for DECR */
>>>>> + env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
>>>>> + }
>>>>> +
>>>>> env->nip = start;
>>>>> env->gpr[3] = r3;
>>>>> cs->halted = 0;
>>>>
>>>> The other option I'm wondering about here is to actually add a
>>>> "shutdown" (or something) method to the cpu class, which does whatever
>>>> is necessary to put the vcpu into a quiescent state that won't be
>>>> woken up unless it's specifically requested.
>>>
>>> yes. That is a good idea.
>>>
>>> Thanks,
>>>
>>> C.
>>>
>>>
>>>>> @@ -210,6 +220,16 @@ static void rtas_stop_self(PowerPCCPU *cpu,
>>>>> sPAPRMachineState *spapr,
>>>>> * no need to bother with specific bits, we just clear it.
>>>>> */
>>>>> env->msr = 0;
>>>>> +
>>>>> + /* Don't let the decremeter run on a CPU being stopped. This could
>>>>> + * deliver an interrupt on a dying CPU and crash the guest.
>>>>> + */
>>>>> + if (ppc_cpu_pvr_match(cpu, CPU_POWERPC_LOGICAL_3_00)) {
>>>>> + env->spr[SPR_LPCR] &= ~LPCR_DEE;
>>>>> + } else {
>>>>> + /* P7 and P8 both have same bit for DECR */
>>>>> + env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
>>>>> + }
>>>>> }
>>>>>
>>>>> static inline int sysparm_st(target_ulong addr, target_ulong len,
>>>>> diff --git a/target/ppc/translate_init.c b/target/ppc/translate_init.c
>>>>> index 0d6379fcc5b4..1a62159843e7 100644
>>>>> --- a/target/ppc/translate_init.c
>>>>> +++ b/target/ppc/translate_init.c
>>>>> @@ -8905,6 +8905,7 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu,
>>>>> PPCVirtualHypervisor *vhyp)
>>>>> CPUPPCState *env = &cpu->env;
>>>>> ppc_spr_t *lpcr = &env->spr_cb[SPR_LPCR];
>>>>> ppc_spr_t *amor = &env->spr_cb[SPR_AMOR];
>>>>> + CPUState *cs = CPU(cpu);
>>>>>
>>>>> cpu->vhyp = vhyp;
>>>>>
>>>>> @@ -8946,8 +8947,15 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu,
>>>>> PPCVirtualHypervisor *vhyp)
>>>>> } else {
>>>>> lpcr->default_value &= ~(LPCR_UPRT | LPCR_GTSE);
>>>>> }
>>>>> - lpcr->default_value |= LPCR_PDEE | LPCR_HDEE | LPCR_EEE |
>>>>> LPCR_DEE |
>>>>> + lpcr->default_value |= LPCR_PDEE | LPCR_HDEE | LPCR_EEE |
>>>>> LPCR_OEE;
>>>>
>>>> But I guess we'd also need a "set_papr" method to go with that.
>>>>
>>>>> +
>>>>> + /* Only let the decremeter wake up the boot CPU. The RTAS
>>>>> + * command start-cpu will enable it on secondaries.
>>>>> + */
>>>>> + if (cs == first_cpu) {
>>>>> + lpcr->default_value |= LPCR_DEE;
>>>>> + }
>>>>> break;
>>>>> default:
>>>>> /* P7 and P8 has slightly different PECE bits, mostly because P8
>>>>> adds
>>>>> @@ -8955,7 +8963,14 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu,
>>>>> PPCVirtualHypervisor *vhyp)
>>>>> * will work as expected for both implementations
>>>>> */
>>>>> lpcr->default_value |= LPCR_P8_PECE0 | LPCR_P8_PECE1 |
>>>>> LPCR_P8_PECE2 |
>>>>> - LPCR_P8_PECE3 | LPCR_P8_PECE4;
>>>>> + LPCR_P8_PECE4;
>>>>> +
>>>>> + /* Only let the decremeter wake up the boot CPU. The RTAS
>>>>> + * command start-cpu will enable it on secondaries.
>>>>> + */
>>>>> + if (cs == first_cpu) {
>>>>> + lpcr->default_value |= LPCR_P8_PECE3;
>>>>> + }
>>>>> }
>>>>>
>>>>> /* We should be followed by a CPU reset but update the active value
>>>>
>>>
>>
>
[Qemu-ppc] [PATCH v2 3/4] spapr/rtas: fix reboot of a SMP TCG guest, Cédric Le Goater, 2017/10/09
[Qemu-ppc] [PATCH v2 4/4] spapr/rtas: do not reset the MSR in stop-self command, Cédric Le Goater, 2017/10/09