[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages
From: |
Michael Kerrisk |
Subject: |
Re: [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages |
Date: |
Fri, 4 Jul 2014 13:30:55 +0200 |
address@hidden
Hi Andrea,
On Wed, Jul 2, 2014 at 6:50 PM, Andrea Arcangeli <address@hidden> wrote:
> This new syscall will move anon pages across vmas, atomically and
> without touching the vmas.
Please CC linux-api on patches that change the API/ABI. (See
https://www.kernel.org/doc/man-pages/linux-api-ml.html)
Cheers,
Michael
> It only works on non shared anonymous pages because those can be
> relocated without generating non linear anon_vmas in the rmap code.
>
> It is the ideal mechanism to handle userspace page faults. Normally
> the destination vma will have VM_USERFAULT set with
> madvise(MADV_USERFAULT) while the source vma will normally have
> VM_DONTCOPY set with madvise(MADV_DONTFORK).
>
> MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
> the process forks during the userland page fault.
>
> The thread triggering the sigbus signal handler by touching an
> unmapped hole in the MADV_USERFAULT region, should take care to
> receive the data belonging in the faulting virtual address in the
> source vma. The data can come from the network, storage or any other
> I/O device. After the data has been safely received in the private
> area in the source vma, it will call remap_anon_pages to map the page
> in the faulting address in the destination vma atomically. And finally
> it will return from the signal handler.
>
> It is an alternative to mremap.
>
> It only works if the vma protection bits are identical from the source
> and destination vma.
>
> It can remap non shared anonymous pages within the same vma too.
>
> If the source virtual memory range has any unmapped holes, or if the
> destination virtual memory range is not a whole unmapped hole,
> remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
> provides a very strict behavior to avoid any chance of memory
> corruption going unnoticed if there are userland race conditions. Only
> one thread should resolve the userland page fault at any given time
> for any given faulting address. This means that if two threads try to
> both call remap_anon_pages on the same destination address at the same
> time, the second thread will get an explicit error from this syscall.
>
> The syscall retval will return "len" is succesful. The syscall however
> can be interrupted by fatal signals or errors. If interrupted it will
> return the number of bytes successfully remapped before the
> interruption if any, or the negative error if none. It will never
> return zero. Either it will return an error or an amount of bytes
> successfully moved. If the retval reports a "short" remap, the
> remap_anon_pages syscall should be repeated by userland with
> src+retval, dst+reval, len-retval if it wants to know about the error
> that interrupted it.
>
> The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> errors to materialize if there are holes in the source virtual range
> that is being remapped. The holes will be accounted as successfully
> remapped in the retval of the syscall. This is mostly useful to remap
> hugepage naturally aligned virtual regions without knowing if there
> are transparent hugepage in the regions or not, but preventing the
> risk of having to split the hugepmd during the remap.
>
> The main difference with mremap is that if used to fill holes in
> unmapped anonymous memory vmas (if used in combination with
> MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
> vmas. mremap instead would create lots of vmas (because of non linear
> vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
> limited).
>
> MADV_USERFAULT and remap_anon_pages() can be tested with a program
> like below:
>
> ===
> #define _GNU_SOURCE
> #include <sys/mman.h>
> #include <pthread.h>
> #include <strings.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <errno.h>
> #include <string.h>
> #include <signal.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
>
> #define USE_USERFAULT
> #define THP
>
> #define MADV_USERFAULT 18
>
> #define SIZE (1024*1024*1024)
>
> #define SYS_remap_anon_pages 317
>
> static volatile unsigned char *c, *tmp;
>
> void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
> {
> unsigned char *addr = info->si_addr;
> int len = 4096;
> int ret;
>
> addr = (unsigned char *) ((unsigned long) addr &
> ~((getpagesize())-1));
> #ifdef THP
> addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
> len = 2*1024*1024;
> #endif
> if (addr >= c && addr < c + SIZE) {
> unsigned long offset = addr - c;
> ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset,
> len, 0);
> if (ret != len)
> perror("sigbus remap_anon_pages"), exit(1);
> //printf("sigbus offset %lu\n", offset);
> return;
> }
>
> printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
> }
>
> int main()
> {
> struct sigaction sa;
> int ret;
> unsigned long i;
> #ifndef THP
> /*
> * Fails with THP due lack of alignment because of memset
> * pre-filling the destination
> */
> c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
> MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> if (c == MAP_FAILED)
> perror("mmap"), exit(1);
> tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
> MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
> if (tmp == MAP_FAILED)
> perror("mmap"), exit(1);
> #else
> ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
> if (ret)
> perror("posix_memalign"), exit(1);
> ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
> if (ret)
> perror("posix_memalign"), exit(1);
> #endif
> /*
> * MADV_USERFAULT must run before memset, to avoid THP 2m
> * faults to map memory into "tmp", if "tmp" isn't allocated
> * with hugepage alignment.
> */
> if (madvise((void *)c, SIZE, MADV_USERFAULT))
> perror("madvise"), exit(1);
> memset((void *)tmp, 0xaa, SIZE);
>
> sa.sa_sigaction = userfault_sighandler;
> sigemptyset(&sa.sa_mask);
> sa.sa_flags = SA_SIGINFO;
> sigaction(SIGBUS, &sa, NULL);
>
> #ifndef USE_USERFAULT
> ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
> if (ret != SIZE)
> perror("remap_anon_pages"), exit(1);
> #endif
>
> for (i = 0; i < SIZE; i += 4096) {
> if ((i/4096) % 2) {
> /* exercise read and write MADV_USERFAULT */
> c[i+1] = 0xbb;
> }
> if (c[i] != 0xaa)
> printf("error %x offset %lu\n", c[i], i), exit(1);
> }
> printf("remap_anon_pages functions correctly\n");
>
> return 0;
> }
> ===
>
> Signed-off-by: Andrea Arcangeli <address@hidden>
> ---
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> include/linux/huge_mm.h | 7 +
> include/linux/syscalls.h | 4 +
> kernel/sys_ni.c | 1 +
> mm/fremap.c | 477
> +++++++++++++++++++++++++++++++++++++++
> mm/huge_memory.c | 110 +++++++++
> 7 files changed, 601 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl
> b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..08bc856 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
> 351 i386 sched_setattr sys_sched_setattr
> 352 i386 sched_getattr sys_sched_getattr
> 353 i386 renameat2 sys_renameat2
> +354 i386 remap_anon_pages sys_remap_anon_pages
> diff --git a/arch/x86/syscalls/syscall_64.tbl
> b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..37bd179 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
> 314 common sched_setattr sys_sched_setattr
> 315 common sched_getattr sys_sched_getattr
> 316 common renameat2 sys_renameat2
> +317 common remap_anon_pages sys_remap_anon_pages
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 3a2c57e..9a37dd5 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
> extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, pgprot_t newprot,
> int prot_numa);
> +extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> + pmd_t *dst_pmd, pmd_t *src_pmd,
> + pmd_t dst_pmdval,
> + struct vm_area_struct *dst_vma,
> + struct vm_area_struct *src_vma,
> + unsigned long dst_addr,
> + unsigned long src_addr);
>
> enum transparent_hugepage_flag {
> TRANSPARENT_HUGEPAGE_FLAG,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..19edb00 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -447,6 +447,10 @@ asmlinkage long sys_mremap(unsigned long addr,
> asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> unsigned long prot, unsigned long pgoff,
> unsigned long flags);
> +asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
> + unsigned long src_start,
> + unsigned long len,
> + unsigned long flags);
> asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
> asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
> asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int
> advice);
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..6fc1aca 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -177,6 +177,7 @@ cond_syscall(sys_mincore);
> cond_syscall(sys_madvise);
> cond_syscall(sys_mremap);
> cond_syscall(sys_remap_file_pages);
> +cond_syscall(sys_remap_anon_pages);
> cond_syscall(compat_sys_move_pages);
> cond_syscall(compat_sys_migrate_pages);
>
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 1e509f7..9337637 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
> if (ptl1 != ptl2)
> spin_unlock(ptl2);
> }
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +/*
> + * The mmap_sem for reading is held by the caller. Just move the page
> + * from src_pmd to dst_pmd if possible, and return true if succeeded
> + * in moving the page.
> + */
> +static int remap_anon_pages_pte(struct mm_struct *mm,
> + pte_t *dst_pte, pte_t *src_pte, pmd_t
> *src_pmd,
> + struct vm_area_struct *dst_vma,
> + struct vm_area_struct *src_vma,
> + unsigned long dst_addr,
> + unsigned long src_addr,
> + spinlock_t *dst_ptl,
> + spinlock_t *src_ptl,
> + unsigned long flags)
> +{
> + struct page *src_page;
> + swp_entry_t entry;
> + pte_t orig_src_pte, orig_dst_pte;
> + struct anon_vma *src_anon_vma, *dst_anon_vma;
> +
> + spin_lock(dst_ptl);
> + orig_dst_pte = *dst_pte;
> + spin_unlock(dst_ptl);
> + if (!pte_none(orig_dst_pte))
> + return -EEXIST;
> +
> + spin_lock(src_ptl);
> + orig_src_pte = *src_pte;
> + spin_unlock(src_ptl);
> + if (pte_none(orig_src_pte)) {
> + if (!(flags & RAP_ALLOW_SRC_HOLES))
> + return -ENOENT;
> + else
> + /* nothing to do to remap an hole */
> + return 0;
> + }
> +
> + if (pte_present(orig_src_pte)) {
> + /*
> + * Pin the page while holding the lock to be sure the
> + * page isn't freed under us
> + */
> + spin_lock(src_ptl);
> + if (!pte_same(orig_src_pte, *src_pte)) {
> + spin_unlock(src_ptl);
> + return -EAGAIN;
> + }
> + src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
> + if (!src_page || !PageAnon(src_page) ||
> + page_mapcount(src_page) != 1) {
> + spin_unlock(src_ptl);
> + return -EBUSY;
> + }
> +
> + get_page(src_page);
> + spin_unlock(src_ptl);
> +
> + /* block all concurrent rmap walks */
> + lock_page(src_page);
> +
> + /*
> + * page_referenced_anon walks the anon_vma chain
> + * without the page lock. Serialize against it with
> + * the anon_vma lock, the page lock is not enough.
> + */
> + src_anon_vma = page_get_anon_vma(src_page);
> + if (!src_anon_vma) {
> + /* page was unmapped from under us */
> + unlock_page(src_page);
> + put_page(src_page);
> + return -EAGAIN;
> + }
> + anon_vma_lock_write(src_anon_vma);
> +
> + double_pt_lock(dst_ptl, src_ptl);
> +
> + if (!pte_same(*src_pte, orig_src_pte) ||
> + !pte_same(*dst_pte, orig_dst_pte) ||
> + page_mapcount(src_page) != 1) {
> + double_pt_unlock(dst_ptl, src_ptl);
> + anon_vma_unlock_write(src_anon_vma);
> + put_anon_vma(src_anon_vma);
> + unlock_page(src_page);
> + put_page(src_page);
> + return -EAGAIN;
> + }
> +
> + BUG_ON(!PageAnon(src_page));
> + /* the PT lock is enough to keep the page pinned now */
> + put_page(src_page);
> +
> + dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> + ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
> + dst_anon_vma);
> + ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
> + dst_addr);
> +
> + if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
> + orig_src_pte))
> + BUG();
> +
> + orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
> + orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
> + dst_vma);
> +
> + set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> +
> + double_pt_unlock(dst_ptl, src_ptl);
> +
> + anon_vma_unlock_write(src_anon_vma);
> + put_anon_vma(src_anon_vma);
> +
> + /* unblock rmap walks */
> + unlock_page(src_page);
> +
> + mmu_notifier_invalidate_page(mm, src_addr);
> + } else {
> + if (pte_file(orig_src_pte))
> + return -EFAULT;
> +
> + entry = pte_to_swp_entry(orig_src_pte);
> + if (non_swap_entry(entry)) {
> + if (is_migration_entry(entry)) {
> + migration_entry_wait(mm, src_pmd, src_addr);
> + return -EAGAIN;
> + }
> + return -EFAULT;
> + }
> +
> + if (swp_entry_swapcount(entry) != 1)
> + return -EBUSY;
> +
> + double_pt_lock(dst_ptl, src_ptl);
> +
> + if (!pte_same(*src_pte, orig_src_pte) ||
> + !pte_same(*dst_pte, orig_dst_pte) ||
> + swp_entry_swapcount(entry) != 1) {
> + double_pt_unlock(dst_ptl, src_ptl);
> + return -EAGAIN;
> + }
> +
> + if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
> + pte_val(orig_src_pte))
> + BUG();
> + set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> +
> + double_pt_unlock(dst_ptl, src_ptl);
> + }
> +
> + return 0;
> +}
> +
> +static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd = NULL;
> +
> + pgd = pgd_offset(mm, address);
> + pud = pud_alloc(mm, pgd, address);
> + if (pud)
> + /*
> + * Note that we didn't run this because the pmd was
> + * missing, the *pmd may be already established and in
> + * turn it may also be a trans_huge_pmd.
> + */
> + pmd = pmd_alloc(mm, pud, address);
> + return pmd;
> +}
> +
> +/**
> + * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
> + * @dst_start: start of the destination virtual memory range
> + * @src_start: start of the source virtual memory range
> + * @len: length of the virtual memory range
> + *
> + * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
> + * zero copy. It only works on non shared anonymous pages because
> + * those can be relocated without generating non linear anon_vmas in
> + * the rmap code.
> + *
> + * It is the ideal mechanism to handle userspace page faults. Normally
> + * the destination vma will have VM_USERFAULT set with
> + * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
> + * set with madvise(MADV_DONTFORK).
> + *
> + * The thread receiving the page during the userland page fault
> + * (MADV_USERFAULT) will receive the faulting page in the source vma
> + * through the network, storage or any other I/O device (MADV_DONTFORK
> + * in the source vma avoids remap_anon_pages to fail with -EBUSY if
> + * the process forks before remap_anon_pages is called), then it will
> + * call remap_anon_pages to map the page in the faulting address in
> + * the destination vma.
> + *
> + * This syscall works purely via pagetables, so it's the most
> + * efficient way to move physical non shared anonymous pages across
> + * different virtual addresses. Unlike mremap()/mmap()/munmap() it
> + * does not create any new vmas. The mapping in the destination
> + * address is atomic.
> + *
> + * It only works if the vma protection bits are identical from the
> + * source and destination vma.
> + *
> + * It can remap non shared anonymous pages within the same vma too.
> + *
> + * If the source virtual memory range has any unmapped holes, or if
> + * the destination virtual memory range is not a whole unmapped hole,
> + * remap_anon_pages will fail respectively with -ENOENT or
> + * -EEXIST. This provides a very strict behavior to avoid any chance
> + * of memory corruption going unnoticed if there are userland race
> + * conditions. Only one thread should resolve the userland page fault
> + * at any given time for any given faulting address. This means that
> + * if two threads try to both call remap_anon_pages on the same
> + * destination address at the same time, the second thread will get an
> + * explicit error from this syscall.
> + *
> + * The syscall retval will return "len" is succesful. The syscall
> + * however can be interrupted by fatal signals or errors. If
> + * interrupted it will return the number of bytes successfully
> + * remapped before the interruption if any, or the negative error if
> + * none. It will never return zero. Either it will return an error or
> + * an amount of bytes successfully moved. If the retval reports a
> + * "short" remap, the remap_anon_pages syscall should be repeated by
> + * userland with src+retval, dst+reval, len-retval if it wants to know
> + * about the error that interrupted it.
> + *
> + * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
> + * errors to materialize if there are holes in the source virtual
> + * range that is being remapped. The holes will be accounted as
> + * successfully remapped in the retval of the syscall. This is mostly
> + * useful to remap hugepage naturally aligned virtual regions without
> + * knowing if there are transparent hugepage in the regions or not,
> + * but preventing the risk of having to split the hugepmd during the
> + * remap.
> + *
> + * If there's any rmap walk that is taking the anon_vma locks without
> + * first obtaining the page lock (for example split_huge_page and
> + * page_referenced_anon), they will have to verify if the
> + * page->mapping has changed after taking the anon_vma lock. If it
> + * changed they should release the lock and retry obtaining a new
> + * anon_vma, because it means the anon_vma was changed by
> + * remap_anon_pages before the lock could be obtained. This is the
> + * only additional complexity added to the rmap code to provide this
> + * anonymous page remapping functionality.
> + */
> +SYSCALL_DEFINE4(remap_anon_pages,
> + unsigned long, dst_start, unsigned long, src_start,
> + unsigned long, len, unsigned long, flags)
> +{
> + struct mm_struct *mm = current->mm;
> + struct vm_area_struct *src_vma, *dst_vma;
> + long err = -EINVAL;
> + pmd_t *src_pmd, *dst_pmd;
> + pte_t *src_pte, *dst_pte;
> + spinlock_t *dst_ptl, *src_ptl;
> + unsigned long src_addr, dst_addr;
> + int thp_aligned = -1;
> + long moved = 0;
> +
> + /*
> + * Sanitize the syscall parameters:
> + */
> + if (src_start & ~PAGE_MASK)
> + return err;
> + if (dst_start & ~PAGE_MASK)
> + return err;
> + if (len & ~PAGE_MASK)
> + return err;
> + if (flags & ~RAP_ALLOW_SRC_HOLES)
> + return err;
> +
> + /* Does the address range wrap, or is the span zero-sized? */
> + if (unlikely(src_start + len <= src_start))
> + return err;
> + if (unlikely(dst_start + len <= dst_start))
> + return err;
> +
> + down_read(&mm->mmap_sem);
> +
> + /*
> + * Make sure the vma is not shared, that the src and dst remap
> + * ranges are both valid and fully within a single existing
> + * vma.
> + */
> + src_vma = find_vma(mm, src_start);
> + if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> + goto out;
> + if (src_start < src_vma->vm_start ||
> + src_start + len > src_vma->vm_end)
> + goto out;
> +
> + dst_vma = find_vma(mm, dst_start);
> + if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> + goto out;
> + if (dst_start < dst_vma->vm_start ||
> + dst_start + len > dst_vma->vm_end)
> + goto out;
> +
> + if (pgprot_val(src_vma->vm_page_prot) !=
> + pgprot_val(dst_vma->vm_page_prot))
> + goto out;
> +
> + /* only allow remapping if both are mlocked or both aren't */
> + if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
> + goto out;
> +
> + /*
> + * Ensure the dst_vma has a anon_vma or this page
> + * would get a NULL anon_vma when moved in the
> + * dst_vma.
> + */
> + err = -ENOMEM;
> + if (unlikely(anon_vma_prepare(dst_vma)))
> + goto out;
> +
> + for (src_addr = src_start, dst_addr = dst_start;
> + src_addr < src_start + len; ) {
> + spinlock_t *ptl;
> + pmd_t dst_pmdval;
> + BUG_ON(dst_addr >= dst_start + len);
> + src_pmd = mm_find_pmd(mm, src_addr);
> + if (unlikely(!src_pmd)) {
> + if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> + err = -ENOENT;
> + break;
> + } else {
> + src_pmd = mm_alloc_pmd(mm, src_addr);
> + if (unlikely(!src_pmd)) {
> + err = -ENOMEM;
> + break;
> + }
> + }
> + }
> + dst_pmd = mm_alloc_pmd(mm, dst_addr);
> + if (unlikely(!dst_pmd)) {
> + err = -ENOMEM;
> + break;
> + }
> +
> + dst_pmdval = pmd_read_atomic(dst_pmd);
> + /*
> + * If the dst_pmd is mapped as THP don't
> + * override it and just be strict.
> + */
> + if (unlikely(pmd_trans_huge(dst_pmdval))) {
> + err = -EEXIST;
> + break;
> + }
> + if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
> + /*
> + * Check if we can move the pmd without
> + * splitting it. First check the address
> + * alignment to be the same in src/dst. These
> + * checks don't actually need the PT lock but
> + * it's good to do it here to optimize this
> + * block away at build time if
> + * CONFIG_TRANSPARENT_HUGEPAGE is not set.
> + */
> + if (thp_aligned == -1)
> + thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
> + (dst_addr & ~HPAGE_PMD_MASK));
> + if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
> + !pmd_none(dst_pmdval) ||
> + src_start + len - src_addr < HPAGE_PMD_SIZE) {
> + spin_unlock(ptl);
> + /* Fall through */
> + split_huge_page_pmd(src_vma, src_addr,
> + src_pmd);
> + } else {
> + BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
> + err = remap_anon_pages_huge_pmd(mm,
> + dst_pmd,
> + src_pmd,
> + dst_pmdval,
> + dst_vma,
> + src_vma,
> + dst_addr,
> + src_addr);
> + cond_resched();
> +
> + if (!err) {
> + dst_addr += HPAGE_PMD_SIZE;
> + src_addr += HPAGE_PMD_SIZE;
> + moved += HPAGE_PMD_SIZE;
> + }
> +
> + if ((!err || err == -EAGAIN) &&
> + fatal_signal_pending(current))
> + err = -EINTR;
> +
> + if (err && err != -EAGAIN)
> + break;
> +
> + continue;
> + }
> + }
> +
> + if (pmd_none(*src_pmd)) {
> + if (!(flags & RAP_ALLOW_SRC_HOLES)) {
> + err = -ENOENT;
> + break;
> + } else {
> + if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
> + src_addr))) {
> + err = -ENOMEM;
> + break;
> + }
> + }
> + }
> +
> + /*
> + * We held the mmap_sem for reading so MADV_DONTNEED
> + * can zap transparent huge pages under us, or the
> + * transparent huge page fault can establish new
> + * transparent huge pages under us.
> + */
> + if (unlikely(pmd_trans_unstable(src_pmd))) {
> + err = -EFAULT;
> + break;
> + }
> +
> + if (unlikely(pmd_none(dst_pmdval)) &&
> + unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
> + dst_addr))) {
> + err = -ENOMEM;
> + break;
> + }
> + /* If an huge pmd materialized from under us fail */
> + if (unlikely(pmd_trans_huge(*dst_pmd))) {
> + err = -EFAULT;
> + break;
> + }
> +
> + BUG_ON(pmd_none(*dst_pmd));
> + BUG_ON(pmd_none(*src_pmd));
> + BUG_ON(pmd_trans_huge(*dst_pmd));
> + BUG_ON(pmd_trans_huge(*src_pmd));
> +
> + dst_pte = pte_offset_map(dst_pmd, dst_addr);
> + src_pte = pte_offset_map(src_pmd, src_addr);
> + dst_ptl = pte_lockptr(mm, dst_pmd);
> + src_ptl = pte_lockptr(mm, src_pmd);
> +
> + err = remap_anon_pages_pte(mm,
> + dst_pte, src_pte, src_pmd,
> + dst_vma, src_vma,
> + dst_addr, src_addr,
> + dst_ptl, src_ptl, flags);
> +
> + pte_unmap(dst_pte);
> + pte_unmap(src_pte);
> + cond_resched();
> +
> + if (!err) {
> + dst_addr += PAGE_SIZE;
> + src_addr += PAGE_SIZE;
> + moved += PAGE_SIZE;
> + }
> +
> + if ((!err || err == -EAGAIN) &&
> + fatal_signal_pending(current))
> + err = -EINTR;
> +
> + if (err && err != -EAGAIN)
> + break;
> + }
> +
> +out:
> + up_read(&mm->mmap_sem);
> + BUG_ON(moved < 0);
> + BUG_ON(err > 0);
> + BUG_ON(!moved && !err);
> + return moved ? moved : err;
> +}
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c37ca..e24cd7c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1541,6 +1541,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t
> *pmd,
> }
>
> /*
> + * The PT lock for src_pmd and the mmap_sem for reading are held by
> + * the caller, but it must return after releasing the
> + * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
> + * until the PT lock of the src_pmd is released. Just move the page
> + * from src_pmd to dst_pmd if possible. Return zero if succeeded in
> + * moving the page, -EAGAIN if it needs to be repeated by the caller,
> + * or other errors in case of failure.
> + */
> +int remap_anon_pages_huge_pmd(struct mm_struct *mm,
> + pmd_t *dst_pmd, pmd_t *src_pmd,
> + pmd_t dst_pmdval,
> + struct vm_area_struct *dst_vma,
> + struct vm_area_struct *src_vma,
> + unsigned long dst_addr,
> + unsigned long src_addr)
> +{
> + pmd_t _dst_pmd, src_pmdval;
> + struct page *src_page;
> + struct anon_vma *src_anon_vma, *dst_anon_vma;
> + spinlock_t *src_ptl, *dst_ptl;
> + pgtable_t pgtable;
> +
> + src_pmdval = *src_pmd;
> + src_ptl = pmd_lockptr(mm, src_pmd);
> +
> + BUG_ON(!pmd_trans_huge(src_pmdval));
> + BUG_ON(pmd_trans_splitting(src_pmdval));
> + BUG_ON(!pmd_none(dst_pmdval));
> + BUG_ON(!spin_is_locked(src_ptl));
> + BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> + src_page = pmd_page(src_pmdval);
> + BUG_ON(!PageHead(src_page));
> + BUG_ON(!PageAnon(src_page));
> + if (unlikely(page_mapcount(src_page) != 1)) {
> + spin_unlock(src_ptl);
> + return -EBUSY;
> + }
> +
> + get_page(src_page);
> + spin_unlock(src_ptl);
> +
> + mmu_notifier_invalidate_range_start(mm, src_addr,
> + src_addr + HPAGE_PMD_SIZE);
> +
> + /* block all concurrent rmap walks */
> + lock_page(src_page);
> +
> + /*
> + * split_huge_page walks the anon_vma chain without the page
> + * lock. Serialize against it with the anon_vma lock, the page
> + * lock is not enough.
> + */
> + src_anon_vma = page_get_anon_vma(src_page);
> + if (!src_anon_vma) {
> + unlock_page(src_page);
> + put_page(src_page);
> + mmu_notifier_invalidate_range_end(mm, src_addr,
> + src_addr + HPAGE_PMD_SIZE);
> + return -EAGAIN;
> + }
> + anon_vma_lock_write(src_anon_vma);
> +
> + dst_ptl = pmd_lockptr(mm, dst_pmd);
> + double_pt_lock(src_ptl, dst_ptl);
> + if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
> + !pmd_same(*dst_pmd, dst_pmdval) ||
> + page_mapcount(src_page) != 1)) {
> + double_pt_unlock(src_ptl, dst_ptl);
> + anon_vma_unlock_write(src_anon_vma);
> + put_anon_vma(src_anon_vma);
> + unlock_page(src_page);
> + put_page(src_page);
> + mmu_notifier_invalidate_range_end(mm, src_addr,
> + src_addr + HPAGE_PMD_SIZE);
> + return -EAGAIN;
> + }
> +
> + BUG_ON(!PageHead(src_page));
> + BUG_ON(!PageAnon(src_page));
> + /* the PT lock is enough to keep the page pinned now */
> + put_page(src_page);
> +
> + dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
> + ACCESS_ONCE(src_page->mapping) = (struct address_space *)
> dst_anon_vma;
> + ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
> +
> + if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
> + src_pmdval))
> + BUG();
> + _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
> + _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
> + set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
> +
> + pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
> + pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
> + double_pt_unlock(src_ptl, dst_ptl);
> +
> + anon_vma_unlock_write(src_anon_vma);
> + put_anon_vma(src_anon_vma);
> +
> + /* unblock rmap walks */
> + unlock_page(src_page);
> +
> + mmu_notifier_invalidate_range_end(mm, src_addr,
> + src_addr + HPAGE_PMD_SIZE);
> + return 0;
> +}
> +
> +/*
> * Returns 1 if a given pmd maps a stable (not under splitting) thp.
> * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
> *
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to address@hidden For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"address@hidden"> address@hidden </a>
--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/
- [Qemu-devel] [PATCH 00/10] RFC: userfault, Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits, Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 02/10] mm: madvise MADV_USERFAULT, Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key, Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 04/10] mm: rmap preparation for remap_anon_pages, Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 05/10] mm: swp_entry_swapcount, Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking, Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault(), Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages, Andrea Arcangeli, 2014/07/02
- Re: [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages,
Michael Kerrisk <=
- [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization, Andrea Arcangeli, 2014/07/02
- [Qemu-devel] [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock, Andrea Arcangeli, 2014/07/02
- Re: [Qemu-devel] [PATCH 00/10] RFC: userfault, Andy Lutomirski, 2014/07/02
- Re: [Qemu-devel] [PATCH 00/10] RFC: userfault, Christopher Covington, 2014/07/03
- Re: [Qemu-devel] [PATCH 00/10] RFC: userfault, Dave Hansen, 2014/07/03