[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] spapr: Rework hash<->radix transitions at CAS
From: |
David Gibson |
Subject: |
Re: [PATCH] spapr: Rework hash<->radix transitions at CAS |
Date: |
Tue, 25 Feb 2020 10:44:28 +1100 |
On Mon, Feb 24, 2020 at 12:18:27PM +0100, Greg Kurz wrote:
> On Wed, 19 Feb 2020 10:21:05 +1100
> David Gibson <address@hidden> wrote:
>
> > On Fri, Feb 14, 2020 at 07:19:00PM +0100, Greg Kurz wrote:
> > > On Fri, 14 Feb 2020 09:28:35 +1100
> > > David Gibson <address@hidden> wrote:
> > >
> > > > On Thu, Feb 13, 2020 at 04:38:38PM +0100, Greg Kurz wrote:
> > > > > Until the CAS negotiation is over, an HPT can be allocated on three
> > > > > different paths:
> > > > >
> > > > > 1) during machine reset if the host doesn't support radix,
> > > > >
> > > > > 2) during CAS if the guest wants hash and doesn't support HPT
> > > > > resizing,
> > > > > in which case we pre-emptively resize the HPT to accomodate maxram,
> > > > >
> > > > > 3) during CAS if no CAS reboot was requested, the guest wants hash but
> > > > > we're currently configured for radix.
> > > > >
> > > > > Depending on the various combinations of host or guest MMU support,
> > > > > HPT resizing guest support and the possibility of a CAS reboot, it
> > > > > is quite hard to know which of these allocates the HPT that will
> > > > > be ultimately used by the guest that wants to do hash. Also, some of
> > > > > them have bugs:
> > > > >
> > > > > - 2) calls spapr_reallocate_hpt() instead of
> > > > > spapr_setup_hpt_and_vrma()
> > > > > and thus doesn't update the VRMA size, even though we've just
> > > > > extended
> > > > > the HPT. Not sure what issues this can cause,
> > > > >
> > > > > - 3) doesn't check for HPT resizing support and will always allocate a
> > > > > small HPT based on the initial RAM size. This caps the total amount
> > > > > of
> > > > > RAM the guest can see, especially if maxram is much higher than the
> > > > > initial ram.
> > > > >
> > > > > We only support guests that do CAS and we already assume that the HPT
> > > > > isn't being used when we do the pre-emptive resizing at CAS. It thus
> > > > > seems reasonable to only allocate the HPT at the end of CAS, when no
> > > > > CAS reboot was requested.
> > > > >
> > > > > Consolidate the logic so that we only create the HPT during 3), ie.
> > > > > when we're done with the CAS reboot cycles, and ensure HPT resizing
> > > > > is taken into account. This fixes the radix->hash transition for
> > > > > all cases.
> > > >
> > > > Uh.. I'm pretty sure this can't work for KVM on a POWER8 host. We
> > > > need the HPT at all times there, or there's nowhere to put VRMA
> > > > entries, so we can't run even in real mode.
> > > >
> > >
> > > Well it happens to be working anyway because KVM automatically
> > > creates an HPT (default size 16MB) in kvmppc_hv_setup_htab_rma()
> > > if QEMU didn't do so already... Would a comment to emphasize this
> > > be enough or do you prefer I don't drop the HPT allocation currently
> > > performed at machine reset ?
> >
> > Relying on the automatic allocation is not a good idea. With host
> > kernels before HPT resizing, once that automatic allocation happens,
> > we can't change the HPT size *at all*, even with a reset or CAS.
> >
>
> Ah ok I see. With these older host kernels, we need QEMU to allocate the
> HPT to fit ms->maxram_size, which KVM doesn't know about, or we'll have
> troubles with VMs that would need a bigger HPT.
Exactly.
> And I guess we want to
> support bigger VMs with pre-4.11 host kernels.
Well, with the usual sizing rules, a 16MiB HPT only supports a 2GiB
guest, so it doesn't have to be a particularly large VM to trip this.
>
> > So, yes, the current code is annoyingly complex, but it's that way for
> > a reason.
> >
>
> My motivation here is to get rid of CAS reboot... it definitely needs more
> thinking on my side.
Yeah, I think so.
>
> > > > > The guest can theoretically call CAS several times, without a CAS
> > > > > reboot in between. Linux guests don't do that, but better safe than
> > > > > sorry, let's ensure we can also handle the symmetrical hash->radix
> > > > > transition correctly: free the HPT and set the GR bit in PATE.
> > > > > An helper is introduced for the latter since this is already what
> > > > > we do during machine reset when going for radix.
> > > > >
> > > > > As a bonus, this removes one user of spapr->cas_reboot, which we
> > > > > want to get rid of in the future.
> > > > >
> > > > > Signed-off-by: Greg Kurz <address@hidden>
> > > > > ---
> > > > > hw/ppc/spapr.c | 25 +++++++++++++++-----
> > > > > hw/ppc/spapr_hcall.c | 59
> > > > > ++++++++++++++++++++----------------------------
> > > > > include/hw/ppc/spapr.h | 1 +
> > > > > 3 files changed, 44 insertions(+), 41 deletions(-)
> > > > >
> > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > > > index 828e2cc1359a..88bc0e4e3ca1 100644
> > > > > --- a/hw/ppc/spapr.c
> > > > > +++ b/hw/ppc/spapr.c
> > > > > @@ -1573,9 +1573,19 @@ void
> > > > > spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
> > > > > {
> > > > > int hpt_shift;
> > > > >
> > > > > + /*
> > > > > + * HPT resizing is a bit of a special case, because when enabled
> > > > > + * we assume an HPT guest will support it until it says it
> > > > > + * doesn't, instead of assuming it won't support it until it says
> > > > > + * it does. Strictly speaking that approach could break for
> > > > > + * guests which don't make a CAS call, but those are so old we
> > > > > + * don't care about them. Without that assumption we'd have to
> > > > > + * make at least a temporary allocation of an HPT sized for max
> > > > > + * memory, which could be impossibly difficult under KVM HV if
> > > > > + * maxram is large.
> > > > > + */
> > > > > if ((spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED)
> > > > > - || (spapr->cas_reboot
> > > > > - && !spapr_ovec_test(spapr->ov5_cas, OV5_HPT_RESIZE))) {
> > > > > + || !spapr_ovec_test(spapr->ov5_cas, OV5_HPT_RESIZE)) {
> > > > > hpt_shift =
> > > > > spapr_hpt_shift_for_ramsize(MACHINE(spapr)->maxram_size);
> > > > > } else {
> > > > > uint64_t current_ram_size;
> > > > > @@ -1604,6 +1614,12 @@ static int spapr_reset_drcs(Object *child,
> > > > > void *opaque)
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +void spapr_reset_patb_entry(SpaprMachineState *spapr)
> > > > > +{
> > > > > + spapr->patb_entry = PATE1_GR;
> > > > > + spapr_set_all_lpcrs(LPCR_HR | LPCR_UPRT, LPCR_HR | LPCR_UPRT);
> > > > > +}
> > > > > +
> > > > > static void spapr_machine_reset(MachineState *machine)
> > > > > {
> > > > > SpaprMachineState *spapr = SPAPR_MACHINE(machine);
> > > > > @@ -1624,10 +1640,7 @@ static void spapr_machine_reset(MachineState
> > > > > *machine)
> > > > > * without a HPT because KVM will start them in radix mode.
> > > > > * Set the GR bit in PATE so that we know there is no HPT.
> > > > > */
> > > > > - spapr->patb_entry = PATE1_GR;
> > > > > - spapr_set_all_lpcrs(LPCR_HR | LPCR_UPRT, LPCR_HR |
> > > > > LPCR_UPRT);
> > > > > - } else {
> > > > > - spapr_setup_hpt_and_vrma(spapr);
> > > > > + spapr_reset_patb_entry(spapr);
> > > > > }
> > > > >
> > > > > qemu_devices_reset();
> > > > > diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> > > > > index b8bb66b5c0d4..57ddf0fa6d05 100644
> > > > > --- a/hw/ppc/spapr_hcall.c
> > > > > +++ b/hw/ppc/spapr_hcall.c
> > > > > @@ -1677,6 +1677,7 @@ static target_ulong
> > > > > h_client_architecture_support(PowerPCCPU *cpu,
> > > > > bool raw_mode_supported = false;
> > > > > bool guest_xive;
> > > > > CPUState *cs;
> > > > > + int maxshift =
> > > > > spapr_hpt_shift_for_ramsize(MACHINE(spapr)->maxram_size);
> > > > >
> > > > > /* CAS is supposed to be called early when only the boot vCPU is
> > > > > active. */
> > > > > CPU_FOREACH(cs) {
> > > > > @@ -1739,36 +1740,6 @@ static target_ulong
> > > > > h_client_architecture_support(PowerPCCPU *cpu,
> > > > >
> > > > > guest_xive = spapr_ovec_test(ov5_guest, OV5_XIVE_EXPLOIT);
> > > > >
> > > > > - /*
> > > > > - * HPT resizing is a bit of a special case, because when enabled
> > > > > - * we assume an HPT guest will support it until it says it
> > > > > - * doesn't, instead of assuming it won't support it until it says
> > > > > - * it does. Strictly speaking that approach could break for
> > > > > - * guests which don't make a CAS call, but those are so old we
> > > > > - * don't care about them. Without that assumption we'd have to
> > > > > - * make at least a temporary allocation of an HPT sized for max
> > > > > - * memory, which could be impossibly difficult under KVM HV if
> > > > > - * maxram is large.
> > > > > - */
> > > > > - if (!guest_radix && !spapr_ovec_test(ov5_guest, OV5_HPT_RESIZE))
> > > > > {
> > > > > - int maxshift =
> > > > > spapr_hpt_shift_for_ramsize(MACHINE(spapr)->maxram_size);
> > > > > -
> > > > > - if (spapr->resize_hpt == SPAPR_RESIZE_HPT_REQUIRED) {
> > > > > - error_report(
> > > > > - "h_client_architecture_support: Guest doesn't
> > > > > support HPT resizing, but resize-hpt=required");
> > > > > - exit(1);
> > > > > - }
> > > > > -
> > > > > - if (spapr->htab_shift < maxshift) {
> > > > > - /* Guest doesn't know about HPT resizing, so we
> > > > > - * pre-emptively resize for the maximum permitted RAM.
> > > > > At
> > > > > - * the point this is called, nothing should have been
> > > > > - * entered into the existing HPT */
> > > > > - spapr_reallocate_hpt(spapr, maxshift, &error_fatal);
> > > > > - push_sregs_to_kvm_pr(spapr);
> > > > > - }
> > > > > - }
> > > > > -
> > > > > /* NOTE: there are actually a number of ov5 bits where input
> > > > > from the
> > > > > * guest is always zero, and the platform/QEMU enables them
> > > > > independently
> > > > > * of guest input. To model these properly we'd want some sort
> > > > > of mask,
> > > > > @@ -1806,6 +1777,12 @@ static target_ulong
> > > > > h_client_architecture_support(PowerPCCPU *cpu,
> > > > > error_report("Guest requested unavailable MMU mode
> > > > > (hash).");
> > > > > exit(EXIT_FAILURE);
> > > > > }
> > > > > + if (spapr->resize_hpt == SPAPR_RESIZE_HPT_REQUIRED &&
> > > > > + !spapr_ovec_test(ov5_guest, OV5_HPT_RESIZE)) {
> > > > > + error_report(
> > > > > + "h_client_architecture_support: Guest doesn't
> > > > > support HPT resizing, but resize-hpt=required");
> > > > > + exit(1);
> > > > > + }
> > > > > }
> > > > > spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest,
> > > > > OV1_PPC_3_00);
> > > > > spapr_ovec_cleanup(ov1_guest);
> > > > > @@ -1838,11 +1815,23 @@ static target_ulong
> > > > > h_client_architecture_support(PowerPCCPU *cpu,
> > > > > void *fdt;
> > > > > SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
> > > > >
> > > > > - /* If spapr_machine_reset() did not set up a HPT but one is
> > > > > necessary
> > > > > - * (because the guest isn't going to use radix) then set it
> > > > > up here. */
> > > > > - if ((spapr->patb_entry & PATE1_GR) && !guest_radix) {
> > > > > - /* legacy hash or new hash: */
> > > > > - spapr_setup_hpt_and_vrma(spapr);
> > > > > + if (!guest_radix) {
> > > > > + /*
> > > > > + * Either spapr_machine_reset() did not set up a HPT but
> > > > > one
> > > > > + * is necessary (because the guest isn't going to use
> > > > > radix),
> > > > > + * or the guest doesn't know about HPT resizing and we
> > > > > need to
> > > > > + * pre-emptively resize for the maximum permitted RAM.
> > > > > Set it
> > > > > + * up here. At the point this is called, nothing should
> > > > > have
> > > > > + * been entered into the existing HPT.
> > > > > + */
> > > > > + if (spapr->patb_entry & PATE1_GR || spapr->htab_shift <
> > > > > maxshift) {
> > > > > + /* legacy hash or new hash: */
> > > > > + spapr_setup_hpt_and_vrma(spapr);
> > > > > + push_sregs_to_kvm_pr(spapr);
> > > > > + }
> > > > > + } else {
> > > > > + spapr_free_hpt(spapr);
> > > > > + spapr_reset_patb_entry(spapr);
> > > > > }
> > > > >
> > > > > if (fdt_bufsize < sizeof(hdr)) {
> > > > > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > > > > index 09110961a589..9d88b5596481 100644
> > > > > --- a/include/hw/ppc/spapr.h
> > > > > +++ b/include/hw/ppc/spapr.h
> > > > > @@ -919,4 +919,5 @@ void spapr_check_pagesize(SpaprMachineState
> > > > > *spapr, hwaddr pagesize,
> > > > >
> > > > > void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
> > > > > hwaddr spapr_get_rtas_addr(void);
> > > > > +void spapr_reset_patb_entry(SpaprMachineState *spapr);
> > > > > #endif /* HW_SPAPR_H */
> > > > >
> > > >
> > >
> >
> >
> >
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature