qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH V7 10/29] machine: memfd-alloc option


From: David Hildenbrand
Subject: Re: [PATCH V7 10/29] machine: memfd-alloc option
Date: Fri, 11 Mar 2022 11:25:43 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.2

On 03.03.22 18:21, Michael S. Tsirkin wrote:
> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:
>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>> option is set.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>  include/hw/boards.h |  1 +
>>  qemu-options.hx     |  6 ++++++
>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>>  softmmu/vl.c        |  1 +
>>  trace-events        |  1 +
>>  util/qemu-config.c  |  4 ++++
>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 53a99ab..7739d88 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool 
>> value, Error **errp)
>>      ms->mem_merge = value;
>>  }
>>  
>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    return ms->memfd_alloc;
>> +}
>> +
>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    ms->memfd_alloc = value;
>> +}
>> +
>>  static bool machine_get_usb(Object *obj, Error **errp)
>>  {
>>      MachineState *ms = MACHINE(obj);
>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void 
>> *data)
>>      object_class_property_set_description(oc, "mem-merge",
>>          "Enable/disable memory merge support");
>>  
>> +    object_class_property_add_bool(oc, "memfd-alloc",
>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>> +    object_class_property_set_description(oc, "memfd-alloc",
>> +        "Enable/disable allocating anonymous memory using memfd_create");
>> +
>>      object_class_property_add_bool(oc, "usb",
>>          machine_get_usb, machine_set_usb);
>>      object_class_property_set_description(oc, "usb",
>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> index 9c1c190..a57d7a0 100644
>> --- a/include/hw/boards.h
>> +++ b/include/hw/boards.h
>> @@ -327,6 +327,7 @@ struct MachineState {
>>      char *dt_compatible;
>>      bool dump_guest_core;
>>      bool mem_merge;
>> +    bool memfd_alloc;
>>      bool usb;
>>      bool usb_disabled;
>>      char *firmware;
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 7d47510..33c8173 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>      "                vmport=on|off|auto controls emulation of vmport 
>> (default: auto)\n"
>>      "                dump-guest-core=on|off include guest memory in a core 
>> dump (default=on)\n"
>>      "                mem-merge=on|off controls memory merge support 
>> (default: on)\n"
>> +    "                memfd-alloc=on|off controls allocating anonymous guest 
>> RAM using memfd_create (default: off)\n"
> 
> Question: are there any disadvantages associated with using
> memfd_create? I guess we are using up an fd, but that seems minor.  Any
> reason not to set to on by default? maybe with a fallback option to
> disable that?
> 
> I am concerned that it's actually a kind of memory backend, this flag
> seems to instead be closer to the deprecated mem-prealloc. E.g.
> it does not work with a mem path, does it?

We had a RH-internal discssuion some time ago, here is my writeup (note
the TMPFS/SHMEM discussion):

--- snip ---

In QEMU, we specify the type of guest RAM via
* -object memory-backend-ram,...
* -object memory-backend-file,...
* -object memory-backend-memfd,...

We can specify whether to share the memory (share=on -- MAP_SHARED),
or whether to keep modifications local to QEMU (share=off -- MAP_PRIVATE).

Using "share=off" (or using the default) with files/memfd can have some
serious side-effects.

ALERT: "share=off" is the default in QEMU for memory-backend-ram and
memory-backend-file. "share=on" is the default in QEMU only for
memory-backend-memfd.


I. MAP_SHARED vs. MAP_PRIVATE

MAP_SHARED: when reading, read file content; when writing, modify file
             content.
MAP_PRIVATE: when reading, read file content, except if there was a
              local/private change. When writing, keep change
              local/private and don't modify file content.


MAP_PRIVATE sounds like a snapshot, however, in some cases it really
behaves differently -- especially with tmpfs/shmem and when QEMU
discards memory (e.g., with virtio-balloon or during postcopy live
migration).

There is some connection between MAP_PRIVATE and NUMA bindings that I
have yet to fully explore. We could have issues with some MAP_SHARED
mappings and NUMA bindings (IOW: policy getting ignored).


II Impact on different memory backends/types

II.1. Anonymous memory:

Usage: -object memory-backend-ram,...

We really want "share=off" in 99.99% of all cases. Shared anonymous RAM
-- i.e., sharing RAM with your child processes -- does not really apply
to QEMU and there are some cases that are broken in QEMU [1]; there is
only a single use case in the context of RDMA -- whereby we only need
shared anonymous memory to make mremap() work, not for actually sharing
RAM with someone else.

II.2. TMPFS/SHMEM

Usage: -object memory-backend-memfd,...
        -object memory-backend-file,mem-path=/dev/shm/FILE,...

We really want "share=on" in 99.99999% of all cases. There is a serious
issue when using private mappings on an empty shmem file, whereby we can
get a double memory consumption. The issue is that even when reading
via a private mapping, we will allocate memory for the actual file (==
RAM for tmpfs) -- even if it's just allocating blocks filled with zero.

So doing a -object memory-backend-file,mem-path=/dev/shm/FILE will in
the worst case consume 4G, even though we have an anonymous file -- *we
have to use share=on*.

II.3. Hugetlb

Usage: -object memory-backend-memfd,hugetlb=on,hugetlbsize=2M,...
        -object memory-backend-file,mem-path=/dev/shm/FILE,...

We usually want "share=on". However, there seems to be nothing wrong
about using "memory-backend-memfd" -- IOW an anonymous file; it works as
expected in my tests (fallocate() behaves in weird ways, but that's a
different story).

II.4. "Ordinary" Files

Usage: -object memory-backend-file,mem-path=/some/file,...

We usually want "share=on" in 99.9% of all cases, to have
modifications go back to the file -- for example, for the "big file" use
case where we want to use the actual file storage as memory backend (for
example, when swapping is not desired), such that we can use the page
cache where possible, but writeback the file content to disk when under
memory pressure.

5. DAX/PMEM

Usage: -object memory-backend-file,mem-path=/dev/dax,...

We want "share=on" in 99.99999% of all cases when using dax/pmem in an
emulated NVDIMM for our guest. We want the changes to go back to
dax/pmem a.k.a. the actual NVDIMM (not some mixture of pmem and system RAM).


III. MAP_PRIVATE vs. virtio-balloon and postcopy live migration

Dave told me about a use case where we

a) Start a VM with a MAP_SHARED file as guest RAM until it is booted up
b) Save the VM state, *excluding guest RAM"
c) Start multiple VMs using the VM state and the MAP_PRIVATE file as
guest RAM

This is essentially a fast "guest snapshot". But beware if you end up
discarding memory in QEMU via ram_block_discard_range(), e.g., via
virtio-balloon or via postcopy live migration.

In QEMU, we always discard file content and modified pages in private
mappings.

Problem: If one VM discards memory, it will modify the snapshot. The
snapshot will be broken. New VMs and running VMs will be affected!

Note: We cannot easily teach QEMU to not modify file content when
discarding memory of private mappings. This would break postcopy live in
some cases completely.

--- snip ---

-- 
Thanks,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]