qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SM


From: Laszlo Ersek
Subject: Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
Date: Tue, 17 Jan 2017 19:53:21 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.6.0

On 01/17/17 15:20, Igor Mammedov wrote:
> On Tue, 17 Jan 2017 14:46:05 +0100
> Laszlo Ersek <address@hidden> wrote:
> 
>> On 01/17/17 14:20, Igor Mammedov wrote:
>>> On Tue, 17 Jan 2017 13:06:13 +0100
>>> Laszlo Ersek <address@hidden> wrote:
>>>   
>>>> On 01/17/17 12:21, Igor Mammedov wrote:  
>>>>> On Mon, 16 Jan 2017 21:46:55 +0100
>>>>> Laszlo Ersek <address@hidden> wrote:
>>>>>     
>>>>>> On 01/13/17 14:09, Igor Mammedov wrote:    
>>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100
>>>>>>> Laszlo Ersek <address@hidden> wrote:
>>>>>>>       
>>>>>>>> The generic edk2 SMM infrastructure prefers
>>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each 
>>>>>>>> processor. If
>>>>>>>> Trigger() only brings the current processor into SMM, then edk2 
>>>>>>>> handles it
>>>>>>>> in the following ways:
>>>>>>>>
>>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
>>>>>>>>     ExitBootServices(), but is not necessarily true at runtime), then:
>>>>>>>>
>>>>>>>>     (a) If edk2 has been configured for "traditional" SMM 
>>>>>>>> synchronization,
>>>>>>>>         then the BSP sends directed SMIs to the APs with APIC delivery,
>>>>>>>>         bringing them into SMM individually. Then the BSP runs the SMI
>>>>>>>>         handler / dispatcher.
>>>>>>>>
>>>>>>>>     (b) If edk2 has been configured for "relaxed" SMM synchronization,
>>>>>>>>         then the APs that are not already in SMM are not brought in, 
>>>>>>>> and
>>>>>>>>         the BSP runs the SMI handler / dispatcher.
>>>>>>>>
>>>>>>>> (2) If Trigger() is executed by an AP (which is possible after
>>>>>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
>>>>>>>>     efibootmgr"), then the AP in question brings in the BSP with a
>>>>>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
>>>>>>>>
>>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP
>>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
>>>>>>>> command from (2) can take more than 3 seconds to complete, because
>>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively.
>>>>>>>>
>>>>>>>> The larger problem is that QEMU's current behavior diverges from the
>>>>>>>> behavior usually seen on physical hardware, and that keeps exposing
>>>>>>>> obscure corner cases, race conditions and other instabilities in edk2,
>>>>>>>> which generally expects / prefers a software SMI to affect all CPUs at
>>>>>>>> once.
>>>>>>>>
>>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to 
>>>>>>>> inject
>>>>>>>> the SMI on all VCPUs.
>>>>>>>>
>>>>>>>> While the original posting of this patch
>>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
>>>>>>>> only intended to speed up (2), based on our recent "stress testing" of 
>>>>>>>> SMM
>>>>>>>> this patch actually provides functional improvements.
>>>>>>>>
>>>>>>>> Cc: "Michael S. Tsirkin" <address@hidden>
>>>>>>>> Cc: Gerd Hoffmann <address@hidden>
>>>>>>>> Cc: Igor Mammedov <address@hidden>
>>>>>>>> Cc: Paolo Bonzini <address@hidden>
>>>>>>>> Signed-off-by: Laszlo Ersek <address@hidden>
>>>>>>>> Reviewed-by: Michael S. Tsirkin <address@hidden>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Notes:
>>>>>>>>     v6:
>>>>>>>>     - no changes, pick up Michael's R-b
>>>>>>>>     
>>>>>>>>     v5:
>>>>>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
>>>>>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
>>>>>>>>       DEFINE_PROP_BIT() in the next patch)
>>>>>>>>
>>>>>>>>  include/hw/i386/ich9.h |  3 +++
>>>>>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
>>>>>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
>>>>>>>> index da1118727146..18dcca7ebcbf 100644
>>>>>>>> --- a/include/hw/i386/ich9.h
>>>>>>>> +++ b/include/hw/i386/ich9.h
>>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
>>>>>>>>  #define ICH9_SMB_HST_D1                         0x06
>>>>>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
>>>>>>>>  
>>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
>>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
>>>>>>>> +
>>>>>>>>  #endif /* HW_ICH9_H */
>>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
>>>>>>>> index 376b7801a42c..ced6f803a4f2 100644
>>>>>>>> --- a/hw/isa/lpc_ich9.c
>>>>>>>> +++ b/hw/isa/lpc_ich9.c
>>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, 
>>>>>>>> void *arg)
>>>>>>>>  
>>>>>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
>>>>>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
>>>>>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
>>>>>>>> +        if (lpc->smi_negotiated_features &
>>>>>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
>>>>>>>> +            CPUState *cs;
>>>>>>>> +            CPU_FOREACH(cs) {
>>>>>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>>>>>>>> +            }      
>>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast?      
>>>>>>
>>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code
>>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
>>>>>> code area for execution. The VCPUs are told apart from each other by
>>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or
>>>>>> search criterion into a global shared array. Once they find their
>>>>>> respective private data blocks, the VCPUs don't step on each other's 
>>>>>> toes.    
>>>>> That's what I'm not sure.
>>>>> If broadcast SMI is sent before relocation all CPUs will use
>>>>> the same SMBASE and as result save their state in the same
>>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting
>>>>> each other's saved state    
>>>>
>>>> Good point!
>>>>  
>>>>> and then on exit from SMI they all will restore
>>>>> whatever state that race produced so it seems plain wrong thing to do.
>>>>>
>>>>> Are you sure that edk2 does broadcast for SMI initialization or does it
>>>>> using directed SMIs?    
>>>>
>>>> Hmmm, you are right. The SmmRelocateBases() function in
>>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
>>>> each individual AP in succession, by sending SMI IPIs to them, one by
>>>> one. Then it does the same for the BSP, by sending itself a similar SMI 
>>>> IPI.
>>>>
>>>> The above QEMU code is only exercised much later, after the SMBASE
>>>> relocation, with the SMM stack up and running. It is used when a
>>>> (preferably broadcast) SMI is triggered for firmware services (for
>>>> example, the UEFI variable services).
>>>>
>>>> So, you are right.
>>>>
>>>> Considering edk2 only, the difference shouldn't matter -- when this code
>>>> is invoked (via an IO port write), the SMBASEs will all have been 
>>>> relocated.
>>>>
>>>> Also, I've been informed that on real hardware, writing to APM_CNT
>>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
>>>> So I believe the above code should be closest to real hardware, and the
>>>> pre-patch code was only written in unicast form for SeaBIOS's sake.
>>>>
>>>> I think the code is safe above. If the guest injects an SMI via APM_CNT
>>>> after negotiating SMI broadcast, it should have not left any VCPUs
>>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In
>>>> the future we can tweak this further, if necessary; we have 63 more
>>>> bits, and we'll be able to reject invalid combinations of bits.
>>>>
>>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed
>>>> from the default?
>>>>
>>>> If so, doesn't that run the risk of missing a VCPU that has an actual
>>>> SMI handler installed, and it just happens to be placed at the default
>>>> SMBASE location?
>>>>
>>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
>>>> but I think (a) that's not what real hardware does, and (b) it is risky
>>>> if a VCPU's actual SMI handler has been installed while keeping the
>>>> default SMBASE value.
>>>>
>>>> What do you think?  
>>> (a) probably doesn't consider hotplug, so it's totally fine for the case
>>> where firmware is the only one who wakes up/initializes CPUs.
>>> however consider 2 CPU hotplugged and then broadcast SMI happens
>>> to serve some other task (if there isn't unpark 'feature').
>>> Even if real hw does it, it's behavior not defined by SDM with possibly
>>> bad consequences.
>>>
>>> (b) alone it's risky so I'd skip based on combination of
>>>  
>>>  if (default SMBASE & CPU is in reset state)
>>>     skip;
>>>
>>> that should be safe and it leaves open possibility to avoid adding
>>> unpark 'feature' to CPU.  
>>
>> The above attributes ("SMBASE" and "CPU is in reset state") are specific
>> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the
>> X86_CPU() macro?
>>
>> Can I assume here that the macro will never return NULL? (I think so,
>> LPC is only used with x86.)
> yep, that should work.
> 
>>
>> ... I guess SMBASE can be accessed as
>>
>>   X86_CPU(cs)->env.smbase
> preferred style:
>     X86CPU *cpu = X86_CPU(cs);
>     cpu->...
> 
>>
>> How do I figure out if the CPU is in reset state ("waiting for first
>> INIT") though? Note that this state should be distinguished from simply
>> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected
>> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send
>> APs to sleep.
> 
> how about using EIP reset value?
> x86_cpu_reset()
>    ...
>    env->eip = 0xfff0;

Okay, so I tried this. It doesn't work.

This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
function from ich9_apm_ctrl_changed(), due to size reasons):

> static void ich9_apm_broadcast_smi(void)
> {
>     CPUState *cs;
>
>     cpu_synchronize_all_states(); /* <--------- mark this */
>     CPU_FOREACH(cs) {
>         X86CPU *cpu = X86_CPU(cs);
>         CPUX86State *env = &cpu->env;
>
>         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
>             CPUClass *k = CPU_GET_CLASS(cs);
>             uint64_t cpu_arch_id = k->get_arch_id(cs);
>
>             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
>             continue;
>         }
>
>         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
>     }
> }

There are two cases:

(a) If I do *not* add the cpu_synchronize_all_states() call before the
loop, then the filter condition matches *despite* the APM_CNT write
being performed by OVMF *after* SMBASE relocation. I see the trace
messages in the QEMU log (this is a four VCPU Ia32 guest):

address@hidden:ich9_apm_broadcast_smi_skip cpu_arch_id=0x0
address@hidden:ich9_apm_broadcast_smi_skip cpu_arch_id=0x1
address@hidden:ich9_apm_broadcast_smi_skip cpu_arch_id=0x2
address@hidden:ich9_apm_broadcast_smi_skip cpu_arch_id=0x3
address@hidden:ich9_apm_broadcast_smi_skip cpu_arch_id=0x0
address@hidden:ich9_apm_broadcast_smi_skip cpu_arch_id=0x1
address@hidden:ich9_apm_broadcast_smi_skip cpu_arch_id=0x2
address@hidden:ich9_apm_broadcast_smi_skip cpu_arch_id=0x3

That is, even though SMBASE relocation succeeds first, and APM_CNT is
only written afterwards -- I proved this from the OVMF debug log --, the
code above does not see an up-to-date CPU state for the KVM VCPUs, and
it filters out the SMI for *all* VCPUs.

Needless to say, this breaks the firmware entirely; it cannot boot.

(b) If I add the cpu_synchronize_all_states() call before the loop, then
the incorrect filter matches go away; QEMU sees the KVM VCPU states
correctly, and the SMI is broad-cast.

However, in this case, the boot slows to a *crawl*. It's unbearable. All
VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
with the naked eye, almost line by line.

I didn't expect that cpu_synchronize_all_states() would be so costly,
but it makes the idea of checking VCPU state in the APM_CNT handler a
non-starter.

Can we stick with the current "v6 wave 2" in light of this?

Thanks,
Laszlo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]