[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PULL 10/56] x86: don't let decompressed kernel image clobber setup_
From: |
H. Peter Anvin |
Subject: |
Re: [PULL 10/56] x86: don't let decompressed kernel image clobber setup_data |
Date: |
Tue, 31 Jan 2023 12:54:29 -0800 |
User-agent: |
K-9 Mail for Android |
On January 30, 2023 12:19:14 PM PST, "Michael S. Tsirkin" <mst@redhat.com>
wrote:
>From: "Jason A. Donenfeld" <Jason@zx2c4.com>
>
>The setup_data links are appended to the compressed kernel image. Since
>the kernel image is typically loaded at 0x100000, setup_data lives at
>`0x100000 + compressed_size`, which does not get relocated during the
>kernel's boot process.
>
>The kernel typically decompresses the image starting at address
>0x1000000 (note: there's one more zero there than the compressed image
>above). This usually is fine for most kernels.
>
>However, if the compressed image is actually quite large, then
>setup_data will live at a `0x100000 + compressed_size` that extends into
>the decompressed zone at 0x1000000. In other words, if compressed_size
>is larger than `0x1000000 - 0x100000`, then the decompression step will
>clobber setup_data, resulting in crashes.
>
>Visually, what happens now is that QEMU appends setup_data to the kernel
>image:
>
> kernel image setup_data
> |--------------------------||----------------|
>0x100000 0x100000+l1 0x100000+l1+l2
>
>The problem is that this decompresses to 0x1000000 (one more zero). So
>if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
> kernel image setup_data
> |--------------------------||----------------|
>0x100000 0x100000+l1 0x100000+l1+l2
>
> d e c o m p r e s s e d k e r n e l
>
> |-------------------------------------------------------------|
> 0x1000000
> 0x1000000+l3
>
>The decompressed kernel seemingly overwriting the compressed kernel
>image isn't a problem, because that gets relocated to a higher address
>early on in the boot process, at the end of startup_64. setup_data,
>however, stays in the same place, since those links are self referential
>and nothing fixes them up. So the decompressed kernel clobbers it.
>
>Fix this by appending setup_data to the cmdline blob rather than the
>kernel image blob, which remains at a lower address that won't get
>clobbered.
>
>This could have been done by overwriting the initrd blob instead, but
>that poses big difficulties, such as no longer being able to use memory
>mapped files for initrd, hurting performance, and, more importantly, the
>initrd address calculation is hard coded in qboot, and it always grows
>down rather than up, which means lots of brittle semantics would have to
>be changed around, incurring more complexity. In contrast, using cmdline
>is simple and doesn't interfere with anything.
>
>The microvm machine has a gross hack where it fiddles with fw_cfg data
>after the fact. So this hack is updated to account for this appending,
>by reserving some bytes.
>
>Fixup-by: Michael S. Tsirkin <mst@redhat.com>
>Cc: x86@kernel.org
>Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
>Cc: H. Peter Anvin <hpa@zytor.com>
>Cc: Borislav Petkov <bp@alien8.de>
>Cc: Eric Biggers <ebiggers@kernel.org>
>Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>Message-Id: <20221230220725.618763-1-Jason@zx2c4.com>
>Message-ID: <20230128061015-mutt-send-email-mst@kernel.org>
>Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>Tested-by: Eric Biggers <ebiggers@google.com>
>Tested-by: Mathias Krause <minipli@grsecurity.net>
>---
> include/hw/i386/microvm.h | 5 ++--
> include/hw/nvram/fw_cfg.h | 9 +++++++
> hw/i386/microvm.c | 15 +++++++----
> hw/i386/x86.c | 52 +++++++++++++++++++++------------------
> hw/nvram/fw_cfg.c | 9 +++++++
> 5 files changed, 59 insertions(+), 31 deletions(-)
>
>diff --git a/include/hw/i386/microvm.h b/include/hw/i386/microvm.h
>index fad97a891d..e8af61f194 100644
>--- a/include/hw/i386/microvm.h
>+++ b/include/hw/i386/microvm.h
>@@ -50,8 +50,9 @@
> */
>
> /* Platform virtio definitions */
>-#define VIRTIO_MMIO_BASE 0xfeb00000
>-#define VIRTIO_CMDLINE_MAXLEN 64
>+#define VIRTIO_MMIO_BASE 0xfeb00000
>+#define VIRTIO_CMDLINE_MAXLEN 64
>+#define VIRTIO_CMDLINE_TOTAL_MAX_LEN ((VIRTIO_CMDLINE_MAXLEN + 1) * 16)
>
> #define GED_MMIO_BASE 0xfea00000
> #define GED_MMIO_BASE_MEMHP (GED_MMIO_BASE + 0x100)
>diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
>index 2e503904dc..990dcdbb2e 100644
>--- a/include/hw/nvram/fw_cfg.h
>+++ b/include/hw/nvram/fw_cfg.h
>@@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t
>key,
> void *data, size_t len,
> bool read_only);
>
>+/**
>+ * fw_cfg_read_bytes_ptr:
>+ * @s: fw_cfg device being modified
>+ * @key: selector key value for new fw_cfg item
>+ *
>+ * Reads an existing fw_cfg data pointer.
>+ */
>+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key);
>+
> /**
> * fw_cfg_add_string:
> * @s: fw_cfg device being modified
>diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
>index 170a331e3f..29f30dd6d3 100644
>--- a/hw/i386/microvm.c
>+++ b/hw/i386/microvm.c
>@@ -378,7 +378,8 @@ static void microvm_fix_kernel_cmdline(MachineState
>*machine)
> MicrovmMachineState *mms = MICROVM_MACHINE(machine);
> BusState *bus;
> BusChild *kid;
>- char *cmdline;
>+ char *cmdline, *existing_cmdline;
>+ size_t len;
>
> /*
> * Find MMIO transports with attached devices, and add them to the kernel
>@@ -387,7 +388,8 @@ static void microvm_fix_kernel_cmdline(MachineState
>*machine)
> * Yes, this is a hack, but one that heavily improves the UX without
> * introducing any significant issues.
> */
>- cmdline = g_strdup(machine->kernel_cmdline);
>+ existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg,
>FW_CFG_CMDLINE_DATA);
>+ cmdline = g_strdup(existing_cmdline);
> bus = sysbus_get_default();
> QTAILQ_FOREACH(kid, &bus->children, sibling) {
> DeviceState *dev = kid->child;
>@@ -411,9 +413,12 @@ static void microvm_fix_kernel_cmdline(MachineState
>*machine)
> }
> }
>
>- fw_cfg_modify_i32(x86ms->fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(cmdline) +
>1);
>- fw_cfg_modify_string(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA, cmdline);
>-
>+ len = strlen(cmdline);
>+ if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline)) {
>+ fprintf(stderr, "qemu: virtio mmio cmdline too large, skipping\n");
>+ } else {
>+ memcpy(existing_cmdline, cmdline, len + 1);
>+ }
> g_free(cmdline);
> }
>
>diff --git a/hw/i386/x86.c b/hw/i386/x86.c
>index 78cc131926..eaff4227bd 100644
>--- a/hw/i386/x86.c
>+++ b/hw/i386/x86.c
>@@ -50,6 +50,7 @@
> #include "hw/intc/i8259.h"
> #include "hw/rtc/mc146818rtc.h"
> #include "target/i386/sev.h"
>+#include "hw/i386/microvm.h"
>
> #include "hw/acpi/cpu_hotplug.h"
> #include "hw/irq.h"
>@@ -813,12 +814,18 @@ void x86_load_linux(X86MachineState *x86ms,
> const char *kernel_filename = machine->kernel_filename;
> const char *initrd_filename = machine->initrd_filename;
> const char *dtb_filename = machine->dtb;
>- const char *kernel_cmdline = machine->kernel_cmdline;
>+ char *kernel_cmdline;
> SevKernelLoaderContext sev_load_ctx = {};
> enum { RNG_SEED_LENGTH = 32 };
>
>- /* Align to 16 bytes as a paranoia measure */
>- cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
>+ /*
>+ * Add the NUL terminator, some padding for the microvm cmdline fiddling
>+ * hack, and then align to 16 bytes as a paranoia measure
>+ */
>+ cmdline_size = (strlen(machine->kernel_cmdline) + 1 +
>+ VIRTIO_CMDLINE_TOTAL_MAX_LEN + 16) & ~15;
>+ /* Make a copy, since we might append arbitrary bytes to it later. */
>+ kernel_cmdline = g_strndup(machine->kernel_cmdline, cmdline_size);
>
> /* load the kernel header */
> f = fopen(kernel_filename, "rb");
>@@ -959,12 +966,6 @@ void x86_load_linux(X86MachineState *x86ms,
> initrd_max = x86ms->below_4g_mem_size - acpi_data_size - 1;
> }
>
>- fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
>- fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(kernel_cmdline) + 1);
>- fw_cfg_add_string(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline);
>- sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
>- sev_load_ctx.cmdline_size = strlen(kernel_cmdline) + 1;
>-
> if (protocol >= 0x202) {
> stl_p(header + 0x228, cmdline_addr);
> } else {
>@@ -1091,27 +1092,24 @@ void x86_load_linux(X86MachineState *x86ms,
> exit(1);
> }
>
>- setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
>- kernel_size = setup_data_offset + sizeof(SetupData) + dtb_size;
>- kernel = g_realloc(kernel, kernel_size);
>-
>-
>- setup_data = (SetupData *)(kernel + setup_data_offset);
>+ setup_data_offset = cmdline_size;
>+ cmdline_size += sizeof(SetupData) + dtb_size;
>+ kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
>+ setup_data = (void *)kernel_cmdline + setup_data_offset;
> setup_data->next = cpu_to_le64(first_setup_data);
>- first_setup_data = prot_addr + setup_data_offset;
>+ first_setup_data = cmdline_addr + setup_data_offset;
> setup_data->type = cpu_to_le32(SETUP_DTB);
> setup_data->len = cpu_to_le32(dtb_size);
>-
> load_image_size(dtb_filename, setup_data->data, dtb_size);
> }
>
>- if (!legacy_no_rng_seed) {
>- setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
>- kernel_size = setup_data_offset + sizeof(SetupData) + RNG_SEED_LENGTH;
>- kernel = g_realloc(kernel, kernel_size);
>- setup_data = (SetupData *)(kernel + setup_data_offset);
>+ if (!legacy_no_rng_seed && protocol >= 0x209) {
>+ setup_data_offset = cmdline_size;
>+ cmdline_size += sizeof(SetupData) + RNG_SEED_LENGTH;
>+ kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
>+ setup_data = (void *)kernel_cmdline + setup_data_offset;
> setup_data->next = cpu_to_le64(first_setup_data);
>- first_setup_data = prot_addr + setup_data_offset;
>+ first_setup_data = cmdline_addr + setup_data_offset;
> setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
> setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
> qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
>@@ -1122,6 +1120,12 @@ void x86_load_linux(X86MachineState *x86ms,
> fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
> }
>
>+ fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
>+ fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, cmdline_size);
>+ fw_cfg_add_bytes(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline,
>cmdline_size);
>+ sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
>+ sev_load_ctx.cmdline_size = cmdline_size;
>+
> fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
> fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
> sev_load_ctx.kernel_data = (char *)kernel;
>@@ -1134,7 +1138,7 @@ void x86_load_linux(X86MachineState *x86ms,
> * kernel on the other side of the fw_cfg interface matches the hash of
> the
> * file the user passed in.
> */
>- if (!sev_enabled()) {
>+ if (!sev_enabled() && first_setup_data) {
> SetupDataFixup *fixup = g_malloc(sizeof(*fixup));
>
> memcpy(setup, header, MIN(sizeof(header), setup_size));
>diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
>index a00881bc64..432754eda4 100644
>--- a/hw/nvram/fw_cfg.c
>+++ b/hw/nvram/fw_cfg.c
>@@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void
>*data, size_t len)
> fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true);
> }
>
>+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key)
>+{
>+ int arch = !!(key & FW_CFG_ARCH_LOCAL);
>+
>+ key &= FW_CFG_ENTRY_MASK;
>+ assert(key < fw_cfg_max_entry(s));
>+ return s->entries[arch][key].data;
>+}
>+
> void fw_cfg_add_string(FWCfgState *s, uint16_t key, const char *value)
> {
> size_t sz = strlen(value) + 1;
Saying they are "appended to" is wrong; the loader is free to put them anywhere
in usable RAM that is not covered by the kernel image, the kernel keepout area,
the command line or initrd.
- [PULL 03/56] hw/isa/isa-bus: Turn isa_build_aml() into qbus_build_aml(), (continued)
- [PULL 03/56] hw/isa/isa-bus: Turn isa_build_aml() into qbus_build_aml(), Michael S. Tsirkin, 2023/01/30
- [PULL 05/56] hw/acpi/acpi_dev_interface: Remove unused parameter from AcpiDeviceIfClass::madt_cpu, Michael S. Tsirkin, 2023/01/30
- [PULL 04/56] hw/acpi/piix4: No need to #include "hw/southbridge/piix.h", Michael S. Tsirkin, 2023/01/30
- [PULL 06/56] vhost-user: Correct a reference of TARGET_AARCH64, Michael S. Tsirkin, 2023/01/30
- [PULL 07/56] hw/pci-host: Use register definitions from PCI standard, Michael S. Tsirkin, 2023/01/30
- [PULL 08/56] virtio-rng-pci: fix migration compat for vectors, Michael S. Tsirkin, 2023/01/30
- [PULL 09/56] intel-iommu: Document iova_tree, Michael S. Tsirkin, 2023/01/30
- [PULL 10/56] x86: don't let decompressed kernel image clobber setup_data, Michael S. Tsirkin, 2023/01/30
- [PULL 13/56] tests: acpi: whitelist DSDT blobs for tests that use pci-bridges, Michael S. Tsirkin, 2023/01/30
- [PULL 12/56] tests: acpi: cleanup arguments to make them more readable, Michael S. Tsirkin, 2023/01/30
- [PULL 14/56] tests: acpi: extend pcihp with nested bridges, Michael S. Tsirkin, 2023/01/30
- [PULL 15/56] tests: acpi: update expected blobs, Michael S. Tsirkin, 2023/01/30
- [PULL 17/56] pci_bridge: remove whitespace, Michael S. Tsirkin, 2023/01/30
- [PULL 16/56] tests: acpi: cleanup use_uefi argument usage, Michael S. Tsirkin, 2023/01/30
- [PULL 20/56] pcihp: piix4: do not call acpi_pcihp_reset() when ACPI PCI hotplug is disabled, Michael S. Tsirkin, 2023/01/30
- [PULL 01/56] shpc: disallow unplug when power indicator is blinking, Michael S. Tsirkin, 2023/01/30