[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH 1/2] spapr: Report correct GTSE support via ov5
From: |
Fabiano Rosas |
Subject: |
Re: [RFC PATCH 1/2] spapr: Report correct GTSE support via ov5 |
Date: |
Mon, 14 Mar 2022 19:10:10 -0300 |
David Gibson <david@gibson.dropbear.id.au> writes:
> On Tue, Mar 08, 2022 at 10:23:59PM -0300, Fabiano Rosas wrote:
>> QEMU reports MMU support to the guest via the ibm,architecture-vec-5
>> property of the /chosen node. Byte number 26 specifies Radix Table
>> Expansions, currently only GTSE (Guest Translation Shootdown
>> Enable). This feature determines whether the tlbie instruction (and
>> others) are HV privileged.
>>
>> Up until now, we always reported GTSE=1 to guests. Even after the
>> support for GTSE=0 was added. As part of that support, a kernel
>> command line radix_hcall_invalidate=on was introduced that overrides
>> the GTSE value received via CAS. So a guest can run with GTSE=0 and
>> use the H_RPT_INVALIDATE hcall instead of tlbie.
>>
>> In this scenario, having GTSE always set to 1 by QEMU leads to a crash
>> when running nested KVM guests because KVM does not allow a nested
>> hypervisor to set GTSE support for its nested guests. So a nested
>> guest always uses the same value for LPCR_GTSE as its HV. Since the
>> nested HV disabled GTSE, but the L2 QEMU always reports GTSE=1, we run
>> into a crash when:
>>
>> L1 LPCR_GTSE=0
>> L2 LPCR_GTSE=0
>> L2 CAS GTSE=1
>>
>> The nested guest will run 'tlbie' and crash because the HW looks at
>> LPCR_GTSE, which is clear.
>>
>> Having GTSE disabled in the L1 and enabled in the L2 is not an option
>> because the whole purpose of GTSE is to disallow access to tlbie and
>> we cannot allow L1 to spawn L2s that can access features that L1
>> itself cannot.
>>
>> We also cannot have the guest check the LPCR bit, because LPCR is
>> HV-privileged.
>>
>> So this patch goes through the most intuitive route which is to have
>> QEMU ask KVM about GTSE support and advertise the correct value to the
>> guest. A new KVM_CAP_PPC_GTSE capability is being added.
>>
>> TCG continues to always enable GTSE.
>>
>> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
>> ---
>> hw/ppc/spapr.c | 38 +++++++++++++++++++++++++++++++-------
>> target/ppc/kvm.c | 8 ++++++++
>> target/ppc/kvm_ppc.h | 6 ++++++
>> 3 files changed, 45 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 4cc204f90d..3e95a1831f 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -971,7 +971,7 @@ static void
>> spapr_dt_ov5_platform_support(SpaprMachineState *spapr, void *fdt,
>> 23, 0x00, /* XICS / XIVE mode */
>> 24, 0x00, /* Hash/Radix, filled in below. */
>> 25, 0x00, /* Hash options: Segment Tables == no, GTSE == no. */
>> - 26, 0x40, /* Radix options: GTSE == yes. */
>> + 26, 0x00, /* Radix options, filled in below. */
>> };
>>
>> if (spapr->irq->xics && spapr->irq->xive) {
>> @@ -1000,10 +1000,16 @@ static void
>> spapr_dt_ov5_platform_support(SpaprMachineState *spapr, void *fdt,
>> } else {
>> val[3] = 0x00; /* Hash */
>> }
>> +
>> + if (kvmppc_has_cap_gtse()) {
>> + val[7] = 0x40 /* OV5_MMU_RADIX_GTSE */;
>> + }
>> } else {
>> /* V3 MMU supports both hash and radix in tcg (with dynamic
>> switching) */
>> val[3] = 0xC0;
>> + val[7] = 0x40 /* OV5_MMU_RADIX_GTSE */;
>> }
>> +
>> _FDT(fdt_setprop(fdt, chosen, "ibm,arch-vec-5-platform-support",
>> val, sizeof(val)));
>> }
>> @@ -2824,14 +2830,32 @@ static void spapr_machine_init(MachineState *machine)
>> /* Init numa_assoc_array */
>> spapr_numa_associativity_init(spapr, machine);
>>
>> - if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
>> - ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00,
>> 0,
>> + if (ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00,
>> 0,
>> spapr->max_compat_pvr)) {
>> - spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_300);
>> - /* KVM and TCG always allow GTSE with radix... */
>> - spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_GTSE);
>> +
>> + /* TCG always supports Radix w/ GTSE */
>> + if (!kvm_enabled()) {
>> + spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_300);
>> + spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_GTSE);
>
> Yeah, this is no good. It's never ok to change guest visible
> behaviour depending on host properties (in this case whether it's KVM
> or not). It messes up the invariants we need for migration, which
> require that the guest visible state depend only on the user
> configuration.
>
> The usual way to handle this is with a new capability, you can then
> change the default with the next machine version.
This particular problem is tricky. TCG cannot disable GTSE because it
does not support H_RPT_INVALIDATE. And older kernels that don't know
about the feature require GTSE.
KVM can afford to disable GTSE because we have a compatibility mechanism
(although a bit crooked): We can invert the logic for the KVM_CAP so
that the presence of KVM_CAP_PPC_GTSE_DISABLE would mean QEMU is allowed
to disable GTSE. Then:
- older KVM + new QEMU would keep GTSE enabled;
- older L1 guests are not affected because the host would report
GTSE=1 with the KVM capability. By the time we decide to disable
GTSE for L1 guests hopefully all older kernels will be out of use;
- older nested guests:
- if L1 runs with GTSE=1, are not affected;
- if L1 disabled GTSE via kernel cmdline, are already broken (this
bug). But they would go from crashing to being aborted* by QEMU
(the guest asks for HPT in the lack of GTSE; nested KVM is radix
only);
* there are other broken cases which are fixed completely.
To satisfy TCG we could keep a spapr capability as ON and usually the
guest would pass cap-gtse=off when running with KVM. However this
doesn't work because this crash happens precisely because the nested
guest doesn't know that it needs to use cap-rpt-invalidate=on. Another
cap wouldn't help.
So I think the only way to have a spapr capability for this is if TCG
always defaults to ON and KVM always defaults to OFF. But then we would
be changing guest visible behaviour depending on host properties.