[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_

From: David Hildenbrand
Subject: Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
Date: Thu, 22 Jul 2021 16:13:26 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

On 22.07.21 15:47, Daniel P. Berrangé wrote:
On Thu, Jul 22, 2021 at 03:39:50PM +0200, David Hildenbrand wrote:
On 22.07.21 15:31, Daniel P. Berrangé wrote:
On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
does not require a SIGBUS handler, doesn't actually touch page content,
and avoids context switches; it is, therefore, faster and easier to handle
than our current approach.

While MADV_POPULATE_WRITE is, in general, faster than manual
prefaulting, and especially faster with 4k pages, there is still value in
prefaulting using multiple threads to speed up preallocation.

More details on MADV_POPULATE_WRITE can be found in the Linux commit
4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
page tables") and in the man page proposal [1].

[1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com

This resolves the TODO in do_touch_pages().

In the future, we might want to look into using fallocate(), eventually
combined with MADV_POPULATE_READ, when dealing with shared file

Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
   include/qemu/osdep.h |  7 ++++
   util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
   2 files changed, 74 insertions(+), 21 deletions(-)

@@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
       return NULL;
+static void *do_madv_populate_write_pages(void *arg)
+    MemsetThread *memset_args = (MemsetThread *)arg;
+    const size_t size = memset_args->numpages * memset_args->hpagesize;
+    char * const addr = memset_args->addr;
+    int ret;
+    if (!size) {
+        return NULL;
+    }
+    /* See do_touch_pages(). */
+    qemu_mutex_lock(&page_mutex);
+    while (!threads_created_flag) {
+        qemu_cond_wait(&page_cond, &page_mutex);
+    }
+    qemu_mutex_unlock(&page_mutex);
+    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
+    if (ret) {
+        memset_thread_failed = true;

I'm wondering if this use of memset_thread_failed is sufficient.

This is pre-existing from the current impl, and ends up being
used to set the bool result of 'touch_all_pages'. The caller
of that then does

      if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
          error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
              "pages available to allocate guest RAM");

this was reasonable with the old impl, because the only reason
we ever see 'memset_thread_failed==true' is if we got SIGBUS
due to ENOMEM.

My concern is that madvise() has a bunch of possible errno
codes returned on failure, and we're not distinguishing
them. In the past this kind of thing has burnt us making
failures hard to debug.

Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'

Then, we can make 'touch_all_pages' have an 'Error **errp'
parameter, and it can directly call

   error_setg_errno(errp, memset_thead_errno, ....some message...)

when memset_thread_errno is non-zero, and thus we can remove
the generic message from the caller of touch_all_pages.

If you agree, it'd be best to refactor the existing code to
use this pattern in an initial patch.

We could also simply trace the return value, which should be comparatively
easy to add. We should be getting either -ENOMEM or -EHWPOISON. And the
latter is highly unlikely to happen when actually preallocating.

We made sure that we don't end up with -EINVAL as we're sensing of
MADV_POPULATE_WRITE works on the mapping.

Those are in the "normal" usage scenarios. I'm wondering about the
abnormal scenarios where QEMU code is mistakenly screwed up or
libvirt / mgmt app makes some config mistake. eg we can get
things like EPERM if selinux or seccomp block the madvise
syscall by mistake (common if EQMU is inside docker for example),
or can we get EINVAL if the 'addr' is not page aligned, and so on.

So when it comes to debugging, I'd actually prefer tracing -errno, as the
real error will be of little help to end users.

I don't care about the end users interpreting it, rather us as maintainers
who get a bug report containing insufficient info to diagnose the root

Well, okay. I'll have a look how this turns out.


David / dhildenb

reply via email to

[Prev in Thread] Current Thread [Next in Thread]