[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: RFC: Split EPT huge pages in advance of dirty logging

From: Zhoujian (jay)
Subject: RE: RFC: Split EPT huge pages in advance of dirty logging
Date: Wed, 19 Feb 2020 13:19:08 +0000

Hi Peter,

> -----Original Message-----
> From: Peter Xu [mailto:address@hidden]
> Sent: Wednesday, February 19, 2020 1:43 AM
> To: Zhoujian (jay) <address@hidden>
> Cc: address@hidden; address@hidden; address@hidden;
> address@hidden; address@hidden; Liujinsong (Paul)
> <address@hidden>; linfeng (M) <address@hidden>; wangxin (U)
> <address@hidden>; Huangweidong (C)
> <address@hidden>
> Subject: Re: RFC: Split EPT huge pages in advance of dirty logging
> On Tue, Feb 18, 2020 at 01:13:47PM +0000, Zhoujian (jay) wrote:
> > Hi all,
> >
> > We found that the guest will be soft-lockup occasionally when live
> > migrating a 60 vCPU, 512GiB huge page and memory sensitive VM. The
> > reason is clear, almost all of the vCPUs are waiting for the KVM MMU
> > spin-lock to create 4K SPTEs when the huge pages are write protected. This
> phenomenon is also described in this patch set:
> > https://patchwork.kernel.org/cover/11163459/
> > which aims to handle page faults in parallel more efficiently.
> >
> > Our idea is to use the migration thread to touch all of the guest
> > memory in the granularity of 4K before enabling dirty logging. To be
> > more specific, we split all the PDPE_LEVEL SPTEs into DIRECTORY_LEVEL
> > SPTEs as the first step, and then split all the DIRECTORY_LEVEL SPTEs into
> PAGE_TABLE_LEVEL SPTEs as the following step.
> IIUC, QEMU will prefer to use huge pages for all the anonymous ramblocks
> (please refer to ram_block_add):
>         qemu_madvise(new_block->host, new_block->max_length,

Yes, you're right

> Another alternative I can think of is to add an extra parameter to QEMU to
> explicitly disable huge pages (so that can even be MADV_NOHUGEPAGE
> instead of MADV_HUGEPAGE).  However that should also drag down the
> performance for the whole lifecycle of the VM.  

From the performance point of view, it is better to keep the huge pages
when the VM is not in the live migration state.

> A 3rd option is to make a QMP
> command to dynamically turn huge pages on/off for ramblocks globally.

We're searching a dynamic method too.
We plan to add two new flags for each memory slot, say
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES. These flags can be set

The mapping_level which is called by tdp_page_fault in the kernel side
will return PT_DIRECTORY_LEVEL if the
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag of the memory slot is
set, and return PT_PAGE_TABLE_LEVEL if the
The key steps to split the huge pages in advance of enabling dirty log is
as follows:
1. The migration thread in user space uses
KVM_MEM_FORCE_PT_DIRECTORY_PAGES flag for each memory slot.
2. The migration thread continues to use the KVM_SPLIT_HUGE_PAGES
ioctl (which is newly added) to do the splitting of large pages in the
kernel side.
3. A new vCPU is created temporally(do some initialization but will not
run) to help to do the work, i.e. as the parameter of the tdp_page_fault.
4. Collect the GPA ranges of all the memory slots with the
5. Split the 1G huge pages(collected in step 4) into 2M by calling
tdp_page_fault, since the mapping_level will return
PT_DIRECTORY_LEVEL. Here is the main difference from the usual
path which is caused by the Guest side(EPT violation/misconfig etc),
we call it directly in the hypervisor side.
6. Do some cleanups, i.e. free the vCPU related resources
7. The KVM_SPLIT_HUGE_PAGES ioctl returned to the user space side.
KVM_MEM_FORCE_PT_DIRECTORY_PAGES to repeat step 1 ~ step 7,
in step 5 the 2M huge pages will be splitted into 4K pages.
KVM_MEM_FORCE_PT_PAGE_TABLE_PAGES flags for each memory slot.
10. Then the migration thread calls the log_start ioctl to enable the dirty
logging, and the remaining thing is the same.

What's your take on this, thanks.

Jay Zhou

> Haven't thought deep into any of them, but seems doable.
> Thanks,
> --
> Peter Xu

reply via email to

[Prev in Thread] Current Thread [Next in Thread]