qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] pflash: Only read non-zero parts of backend ima


From: Markus Armbruster
Subject: Re: [Qemu-devel] [PATCH] pflash: Only read non-zero parts of backend image
Date: Thu, 09 May 2019 13:59:45 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Xiang Zheng <address@hidden> writes:

> On 2019/5/8 21:20, Markus Armbruster wrote:
>> Laszlo Ersek <address@hidden> writes:
>> 
>>> Hi Markus,
>>>
>>> On 05/07/19 20:01, Markus Armbruster wrote:
>>>> The subject is slightly misleading.  Holes read as zero.  So do
>>>> non-holes full of zeroes.  The patch avoids reading the former, but
>>>> still reads the latter.
>>>>
>>>> Xiang Zheng <address@hidden> writes:
>>>>
>>>>> Currently we fill the memory space with two 64MB NOR images when
>>>>> using persistent UEFI variables on virt board. Actually we only use
>>>>> a very small(non-zero) part of the memory while the rest significant
>>>>> large(zero) part of memory is wasted.
>>>>
>>>> Neglects to mention that the "virt board" is ARM.
>>>>
>>>>> So this patch checks the block status and only writes the non-zero part
>>>>> into memory. This requires pflash devices to use sparse files for
>>>>> backends.
>>>>
>>>> I started to draft an improved commit message, but then I realized this
>>>> patch can't work.
>>>>
>>>> The pflash_cfi01 device allocates its device memory like this:
>>>>
>>>>     memory_region_init_rom_device(
>>>>         &pfl->mem, OBJECT(dev),
>>>>         &pflash_cfi01_ops,
>>>>         pfl,
>>>>         pfl->name, total_len, &local_err);
>>>>
>>>> pflash_cfi02 is similar.
>>>>
>>>> memory_region_init_rom_device() calls
>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls
>>>> qemu_ram_alloc_internal() calls g_malloc0().  Thus, all the device
>>>> memory gets written to even with this patch.
>>>
>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to
>>> allocate the the new RAMBlock object called "new_block". The actual
>>> guest RAM allocation occurs inside ram_block_add(), which is also called
>>> by qemu_ram_alloc_internal().
>> 
>> You're right.  I should've read more attentively.
>> 
>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to
>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in
>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well.
>>>
>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call
>>> phys_mem_alloc().
>>>
>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with
>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from
>>> "target/s390x/kvm.c" (setting the function pointer to
>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end
>>> up calling the default qemu_anon_ram_alloc() function, through the
>>> funcptr. (I think anyway.)
>>>
>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in
>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc()
>>> passes (-1) for "fd".)
>>>
>>> I may have missed something, of course -- I obviously didn't test it,
>>> just speculated from the source.
>> 
>> Thanks for your sleuthing!
>> 
>>>> I'm afraid you neglected to test.
>> 
>> Accusation actually unsupported.  I apologize, and replace it by a
>> question: have you observed the improvement you're trying to achieve,
>> and if yes, how?
>> 
>
> Yes, we need to create sparse files as the backing images for pflash device.
> To create sparse files like:
>
>    dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0
>    dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc

This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to
64MiB.

>    dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0

This creates the varstore as a 64MiB hole.  As far as I know (very
little), you should use the varstore template that comes with the
firmware binary.

I use

    cp --sparse=always bld/pc-bios/edk2-arm-vars.fd .
    cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd .

These guys are already zero-padded, and I use cp to sparsify.

> Start a VM with below commandline:
>
>     -drive 
> file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\
>     -drive 
> file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \
>
> Then observe the memory usage of the qemu process (THP is on).
>
> 1) Without this patch:
> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v 
> ' 0 kB'
> AnonHugePages:    706560 kB
> AnonHugePages:      2048 kB
> AnonHugePages:     65536 kB    // pflash memory device
> AnonHugePages:     65536 kB    // pflash memory device
> AnonHugePages:      2048 kB
>
> # ps aux | grep qemu-system-aarch64
> RSS: 879684
>
> 2) After applying this patch:
> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v 
> ' 0 kB'
> AnonHugePages:    700416 kB
> AnonHugePages:      2048 kB
> AnonHugePages:      2048 kB    // pflash memory device
> AnonHugePages:      2048 kB    // pflash memory device
> AnonHugePages:      2048 kB
>
> # ps aux | grep qemu-system-aarch64
> RSS: 744380

Okay, this demonstrates the patch succeeds at mapping parts of the
pflash memory as holes.

Do the guests in these QEMU processes run?

> Obviously, there are at least 100MiB memory saved for each guest.

For a definition of "memory".

Next question: what impact on system performance do you observe?

Let me explain.

Virtual memory holes get filled in by demand paging on access.  In other
words, they remain holes only as long as nothing accesses the memory.

Without your patch, we allocate pages at image read time and fill them
with zeroes. If we don't access them again, the kernel will eventually
page them out (assuming you're running with swap).  So the steady state
is "we waste some swap space", not "we waste some physical RAM".

Your patch lets us map pflash memory pages containing only zeros as
holes.

For pages that never get accessed, your patch avoids page allocation,
filling with zeroes, writing to swap (all one-time costs), and saves
some swap space (not commonly an issue).

For pflash memory that gets accessed, your patch merely delays page
allocation from image read time to first access.

I wonder how these savings and delays affect actual system performance.
Without an observable change in system performance, all we'd accomplish
is changing a bunch of numers in /proc/$pid/.

What improvement(s) can you observe?

I guess the best case for your patch is many guests with relatively
small RAM sizes.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]