qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] pflash: Only read non-zero parts of backend ima


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH] pflash: Only read non-zero parts of backend image
Date: Mon, 13 May 2019 15:15:21 +0200
User-agent: Mutt/1.11.3 (2019-02-01)

Am 13.05.2019 um 13:59 hat Markus Armbruster geschrieben:
> Xiang Zheng <address@hidden> writes:
> 
> > On 2019/5/10 23:16, Markus Armbruster wrote:
> >> Xiang Zheng <address@hidden> writes:
> >> 
> >>> On 2019/5/9 19:59, Markus Armbruster wrote:
> >>>> Xiang Zheng <address@hidden> writes:
> >>>>
> >>>>> On 2019/5/8 21:20, Markus Armbruster wrote:
> >>>>>> Laszlo Ersek <address@hidden> writes:
> >>>>>>
> >>>>>>> Hi Markus,
> >>>>>>>
> >>>>>>> On 05/07/19 20:01, Markus Armbruster wrote:
> >>>>>>>> The subject is slightly misleading.  Holes read as zero.  So do
> >>>>>>>> non-holes full of zeroes.  The patch avoids reading the former, but
> >>>>>>>> still reads the latter.
> >>>>>>>>
> >>>>>>>> Xiang Zheng <address@hidden> writes:
> >>>>>>>>
> >>>>>>>>> Currently we fill the memory space with two 64MB NOR images when
> >>>>>>>>> using persistent UEFI variables on virt board. Actually we only use
> >>>>>>>>> a very small(non-zero) part of the memory while the rest significant
> >>>>>>>>> large(zero) part of memory is wasted.
> >>>>>>>>
> >>>>>>>> Neglects to mention that the "virt board" is ARM.
> >>>>>>>>
> >>>>>>>>> So this patch checks the block status and only writes the non-zero 
> >>>>>>>>> part
> >>>>>>>>> into memory. This requires pflash devices to use sparse files for
> >>>>>>>>> backends.
> >>>>>>>>
> >>>>>>>> I started to draft an improved commit message, but then I realized 
> >>>>>>>> this
> >>>>>>>> patch can't work.
> >>>>>>>>
> >>>>>>>> The pflash_cfi01 device allocates its device memory like this:
> >>>>>>>>
> >>>>>>>>     memory_region_init_rom_device(
> >>>>>>>>         &pfl->mem, OBJECT(dev),
> >>>>>>>>         &pflash_cfi01_ops,
> >>>>>>>>         pfl,
> >>>>>>>>         pfl->name, total_len, &local_err);
> >>>>>>>>
> >>>>>>>> pflash_cfi02 is similar.
> >>>>>>>>
> >>>>>>>> memory_region_init_rom_device() calls
> >>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() 
> >>>>>>>> calls
> >>>>>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
> >>>>>>>> memory gets written to even with this patch.
> >>>>>>>
> >>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only 
> >>>>>>> to
> >>>>>>> allocate the the new RAMBlock object called "new_block". The actual
> >>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also 
> >>>>>>> called
> >>>>>>> by qemu_ram_alloc_internal().
> >>>>>>
> >>>>>> You're right.  I should've read more attentively.
> >>>>>>
> >>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
> >>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, 
> >>>>>>> in
> >>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
> >>>>>>>
> >>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and 
> >>>>>>> call
> >>>>>>> phys_mem_alloc().
> >>>>>>>
> >>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
> >>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
> >>>>>>> "target/s390x/kvm.c" (setting the function pointer to
> >>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we 
> >>>>>>> end
> >>>>>>> up calling the default qemu_anon_ram_alloc() function, through the
> >>>>>>> funcptr. (I think anyway.)
> >>>>>>>
> >>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
> >>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
> >>>>>>> passes (-1) for "fd".)
> >>>>>>>
> >>>>>>> I may have missed something, of course -- I obviously didn't test it,
> >>>>>>> just speculated from the source.
> >>>>>>
> >>>>>> Thanks for your sleuthing!
> >>>>>>
> >>>>>>>> I'm afraid you neglected to test.
> >>>>>>
> >>>>>> Accusation actually unsupported.  I apologize, and replace it by a
> >>>>>> question: have you observed the improvement you're trying to achieve,
> >>>>>> and if yes, how?
> >>>>>>
> >>>>>
> >>>>> Yes, we need to create sparse files as the backing images for pflash 
> >>>>> device.
> >>>>> To create sparse files like:
> >>>>>
> >>>>>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
> >>>>>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc
> >>>>
> >>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
> >>>> 64MiB.
> >>>>
> >>>>>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0
> >>>>
> >>>> This creates the varstore as a 64MiB hole.  As far as I know (very
> >>>> little), you should use the varstore template that comes with the
> >>>> firmware binary.
> >>>>
> >>>> I use
> >>>>
> >>>>     cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
> >>>>     cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .
> >>>>
> >>>> These guys are already zero-padded, and I use cp to sparsify.
> >>>>
> >>>>> Start a VM with below commandline:
> >>>>>
> >>>>>     -drive 
> >>>>> file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
> >>>>>     -drive 
> >>>>> file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
> >>>>>
> >>>>> Then observe the memory usage of the qemu process (THP is on).
> >>>>>
> >>>>> 1) Without this patch:
> >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | 
> >>>>> grep -v ' 0 kB'
> >>>>> AnonHugePages:    706560 kB
> >>>>> AnonHugePages:      2048 kB
> >>>>> AnonHugePages:     65536 kB    // pflash memory device
> >>>>> AnonHugePages:     65536 kB    // pflash memory device
> >>>>> AnonHugePages:      2048 kB
> >>>>>
> >>>>> # ps aux | grep qemu-system-aarch64
> >>>>> RSS: 879684
> >>>>>
> >>>>> 2) After applying this patch:
> >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | 
> >>>>> grep -v ' 0 kB'
> >>>>> AnonHugePages:    700416 kB
> >>>>> AnonHugePages:      2048 kB
> >>>>> AnonHugePages:      2048 kB    // pflash memory device
> >>>>> AnonHugePages:      2048 kB    // pflash memory device
> >>>>> AnonHugePages:      2048 kB
> >>>>>
> >>>>> # ps aux | grep qemu-system-aarch64
> >>>>> RSS: 744380
> >>>>
> >>>> Okay, this demonstrates the patch succeeds at mapping parts of the
> >>>> pflash memory as holes.
> >>>>
> >>>> Do the guests in these QEMU processes run?
> >>>
> >>> Yes.
> >> 
> >> Good to know, thanks.
> >> 
> >>>>> Obviously, there are at least 100MiB memory saved for each guest.
> >>>>
> >>>> For a definition of "memory".
> >>>>
> >>>> Next question: what impact on system performance do you observe?
> >>>>
> >>>> Let me explain.
> >>>>
> >>>> Virtual memory holes get filled in by demand paging on access.  In other
> >>>> words, they remain holes only as long as nothing accesses the memory.
> >>>>
> >>>> Without your patch, we allocate pages at image read time and fill them
> >>>> with zeroes. If we don't access them again, the kernel will eventually
> >>>> page them out (assuming you're running with swap).  So the steady state
> >>>> is "we waste some swap space", not "we waste some physical RAM".
> >>>>
> >>>
> >>> Not everybody wants to run with swap because it may cause low performance.
> >> 
> >> Someone running without swap because he heard someone say someone said
> >> swap may be slow is probably throwing away performance.
> >> 
> >> But I assume you mean people running without swap because they measured
> >> their workload and found it more performant without swap.  Legitimate.
> >
> > Yes, and I had ever suffered from the high IO waits with swap.:)
> >
> >> 
> >>>> Your patch lets us map pflash memory pages containing only zeros as
> >>>> holes.
> >>>>
> >>>> For pages that never get accessed, your patch avoids page allocation,
> >>>> filling with zeroes, writing to swap (all one-time costs), and saves
> >>>> some swap space (not commonly an issue).
> >>>>
> >>>> For pflash memory that gets accessed, your patch merely delays page
> >>>> allocation from image read time to first access.
> >>>>
> >>>> I wonder how these savings and delays affect actual system performance.
> >>>> Without an observable change in system performance, all we'd accomplish
> >>>> is changing a bunch of numers in /proc/$pid/.
> >>>>
> >>>> What improvement(s) can you observe?
> >>>
> >>> We only use pflash device for UEFI, and we hardly care about the 
> >>> performance.
> >>> I think the bottleneck of the performance is the MMIO emulation, even this
> >>> patch would delay page allocation at the first access.
> >> 
> >> I wasn't inquiring about the performance of the pflash device.  I was
> >> inquiring about *system* performance.  But let me rephrase my question.
> >> 
> >> Doing work to save resources is only worthwhile if something valuable
> >> gets better in a measurable way.  I'm asking you
> >> 
> >> (1) to explain what exactly you value, and 
> >> 
> >> (2) to provide measurements that show improvement.
> >> 
> >
> > What we exactly value is the cost of memory resources and it is the only
> > thing that this patch aims to resolve.
> 
> Then measure this cost!
> 
> > I am confused that why you think it will impact the system performance? Did 
> > I
> > neglect something?
> 
> If the patch does not impact how the system as a whole performs, then
> it's useless.
> 
> Since you find it useful, it must have some valuable[*] observable
> effect for you.  Tell us about it!
> 
> I keep asking not to torment you, but to guide you towards building a
> compelling justification for your patch.  However, I can only show you
> the path; the walking you'll have to do yourself.

Is this discussion really a good use of our time?

The patch is simple, and a few obvious improvements it brings were
mentioned (even by yourself), such as avoiding OOM without swap; and
with swap enabled, saving swap space for more useful content and
saving unnecessary I/O related to accessing swap needlessly.

You may consider these improvements neglegible, but even small
improvments can add up. If you really want to measure them, it should be
clear how to do it. I don't see value in actually setting up such
environments just to get some numbers that show what we already know.

So, what's the downside of the patch? The worst case is, the memory
usage numbers only look better, but most people don't have a use case
where the improvement matters. There might be some maintenance cost
associated with the code, but it's small and I suspect this discussion
has already cost us more time than maintaining the code will ever cost
us.

So why not just take it?

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]