qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] softmmu/physmem: Use qemu_madvise


From: Andrew Deason
Subject: Re: [PATCH] softmmu/physmem: Use qemu_madvise
Date: Wed, 16 Mar 2022 09:29:00 -0500

On Wed, 16 Mar 2022 10:41:41 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 16.03.22 10:37, Dr. David Alan Gilbert wrote:
> > * Peter Maydell (peter.maydell@linaro.org) wrote:
> >> On Wed, 16 Mar 2022 at 07:53, David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 16.03.22 05:04, Andrew Deason wrote:
> >>>> We have a thin wrapper around madvise, called qemu_madvise, which
> >>>> provides consistent behavior for the !CONFIG_MADVISE case, and works
> >>>> around some platform-specific quirks (some platforms only provide
> >>>> posix_madvise, and some don't offer all 'advise' types). This specific
> >>>> caller of madvise has never used it, tracing back to its original
> >>>> introduction in commit e0b266f01dd2 ("migration_completion: Take
> >>>> current state").
> >>>>
> >>>> Call qemu_madvise here, to follow the same logic as all of our other
> >>>> madvise callers. This slightly changes the behavior for
> >>>> !CONFIG_MADVISE (EINVAL instead of ENOSYS, and a slightly different
> >>>> error message), but this is now more consistent with other callers
> >>>> that use qemu_madvise.
> >>>>
> >>>> Signed-off-by: Andrew Deason <adeason@sinenomine.net>
> >>>> ---
> >>>> Looking at the history of commits that touch this madvise() call, it
> >>>> doesn't _look_ like there's any reason to be directly calling madvise vs
> >>>> qemu_advise (I don't see anything mentioned), but I'm not sure.
> >>>>
> >>>>  softmmu/physmem.c | 12 ++----------
> >>>>  1 file changed, 2 insertions(+), 10 deletions(-)
> >>>>
> >>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> >>>> index 43ae70fbe2..900c692b5e 100644
> >>>> --- a/softmmu/physmem.c
> >>>> +++ b/softmmu/physmem.c
> >>>> @@ -3584,40 +3584,32 @@ int ram_block_discard_range(RAMBlock *rb, 
> >>>> uint64_t start, size_t length)
> >>>>                           rb->idstr, start, length, ret);
> >>>>              goto err;
> >>>>  #endif
> >>>>          }
> >>>>          if (need_madvise) {
> >>>>              /* For normal RAM this causes it to be unmapped,
> >>>>               * for shared memory it causes the local mapping to 
> >>>> disappear
> >>>>               * and to fall back on the file contents (which we just
> >>>>               * fallocate'd away).
> >>>>               */
> >>>> -#if defined(CONFIG_MADVISE)
> >>>>              if (qemu_ram_is_shared(rb) && rb->fd < 0) {
> >>>> -                ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE);
> >>>> +                ret = qemu_madvise(host_startaddr, length, 
> >>>> QEMU_MADV_REMOVE);
> >>>>              } else {
> >>>> -                ret = madvise(host_startaddr, length, 
> >>>> QEMU_MADV_DONTNEED);
> >>>> +                ret = qemu_madvise(host_startaddr, length, 
> >>>> QEMU_MADV_DONTNEED);
> >>>
> >>> posix_madvise(QEMU_MADV_DONTNEED) has completely different semantics
> >>> then madvise() -- it's not a discard that we need here.
> >>>
> >>> So ram_block_discard_range() would now succeed in environments (BSD?)
> >>> where it's supposed to fail.
> >>>
> >>> So AFAIKs this isn't sane.
> >>
> >> But CONFIG_MADVISE just means "host has madvise()"; it doesn't imply
> >> "this is a Linux madvise() with MADV_DONTNEED". Solaris madvise()
> >> doesn't seem to have  MADV_DONTNEED at all; a quick look at the
> >> FreeBSD manpage suggests its madvise MADV_DONTNEED is identical
> >> to its posix_madvise MADV_DONTNEED.
> >>
> >> If we need "specifically Linux MADV_DONTNEED semantics" maybe we
> >> should define a QEMU_MADV_LINUX_DONTNEED which either (a) does the
> >> right thing or (b) fails, and use qemu_madvise() regardless.
> >>
> >> Certainly the current code is pretty fragile to being changed by
> >> people who don't understand the undocumented subtlety behind
> >> the use of a direct madvise() call here.
> > 
> > Yeh and I'm not sure I can remembe rall the subtleties; there's a big
> > hairy set of ifdef's in include/qemu/madvise.h that makes
> > sure we always have the definition of QEMU_MADV_REMOVE/DONTNEED
> > even on platforms that might not define it themselves.
> > 
> > But I think this code is used for things with different degrees
> > of care about the semantics; e.g. 'balloon' just cares that
> > it frees memory up and doesn't care about the detailed semantics
> > that much; so it's probably fine with that.
> > Postcopy is much more touchy, but then it's only going to be
> > calling this on Linux anyway (because of the userfault dependency).
> 
> MADV_DONTNEED/MADV_REMOVE only provides discard semantics on Linux IIRC
> -- and that's what we want to achieve: ram_block_discard_range()
> 
> So I agree with Peter that we might want to make this more explicit.

Maybe this could use an argument like QEMU_MADV_DONTNEED_DISCARD, which
gets #define'd to QEMU_MADV_INVALID for posix_madvise, or based on a
configure-time test, or just all non-Linux to be safe? Regardless, it
means "MADV_DONTNEED with discard semantics".

Solaris does have MADV_DONTNEED (possibly new in Solaris 11; I haven't
checked the history), but no MADV_REMOVE. As of 11.4.42 CBE, madvise(3C) says:

       MADV_DONTNEED

           Tell  the  kernel  that  the  specified  address range is no longer
           needed, so the system starts to free the resources associated  with
           the address range.

           MADV_DONTNEED  and  MADV_FREE  perform  related but distinct opera-
           tions. MADV_DONTNEED tries to move  any  data  from  the  specified
           address  range  out  of memory, but it ensures that the contents of
           that range  will  be  recovered  when  they  are  next  referenced.
           MADV_FREE  does not attempt to preserve the contents of the address
           range. As a result, subsequent references to an address range  that
           received  madvise (MADV_DONTNEED) are likely to be slower than ref-
           erences to a range that received madvise (MADV_FREE).

Is there a small test program I could run to see if the semantics are
what we want? Or should we just assume only Linux works here; I assume
that's safer.

-- 
Andrew Deason
adeason@sinenomine.net



reply via email to

[Prev in Thread] Current Thread [Next in Thread]