[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] race between tb_gen_code() and qemu_ram_free()
From: |
Paolo Bonzini |
Subject: |
Re: [Qemu-devel] race between tb_gen_code() and qemu_ram_free() |
Date: |
Fri, 20 Apr 2018 20:20:05 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 |
On 20/04/2018 19:55, Peter Maydell wrote:
> There seems to be a race between tb_gen_code() and qemu_ram_free(),
> which results in an abort() in Edgar's test case that exercises the
> xilinx-spips mmio-exec functionality.
>
> Here's what happens:
> (1) memory_region_invalidate_mmio_ptr() is called, and it deletes
^^
This "it" is more precisely memory_region_do_invalidate_mmio_ptr.
> the temporary ram MemoryRegion. This doesn't immediately result in
> deletion of the RAMBlock (which is done from the RCU thread later)
> (2) we try to execute from this physaddr, so tb_gen_code() is called.
> This calls get_page_addr_code(). Since the RAMBlock hasn't yet
> really been deleted, get_page_addr_code() finds it, and returns a
> ram_addr in the about-to-go-away memory...
> (3) while the CPU thread is busy doing codegen, the RCU thread runs.
> It does a flatview_destroy() which ends up doing an object_unref on the
> memory region, whose destructor calls qemu_ram_free(), removing the
> RAMBlock from the ram list
> (4) when the CPU thread has done its codegen, it calls tb_link_page,
> which calls tb_alloc_page, which calls tlb_protect_code(), with the
> ram_addr that it got in step 2. We end up in tlb_reset_dirty_range_all(),
> which calls qemu_get_ram_block() on that ram_addr.
> (5) qemu_get_ram_block() aborts, because it can't find the RAMBlock
> corresponding to the ram_addr (because we removed it in step 3).
>
> How should we resolve this race ?
Note that qemu_ram_free() is _also_ RCU-freeing the RAMBlock. If it is
not found, it means that codegen is not running within
rcu_read_lock()/rcu_read_unlock(). In fact it's not.
So one possibility would be to add it there. Theoretically the simplest
possibility, but rcu_read_lock() is taken _outside_ tb_lock which may be
a problem.
The alternative is to add a ref/unref pair of the MemoryRegion whenever
get_page_addr_code() uses it. That would delay the MemoryRegion
destructor after the end of code generation. A rough sketch would be
that get_page_addr_code() would grow a MemoryRegion** argument and would
use memory_region_from_host() instead of qemu_ram_addr_from_host().
get_page_addr_code() would either call rcu_read_lock(), or the callers
would (see the doc comment for memory_region_from_host(); it always
feels good when you can refer to doc comments!)
Thanks,
Paolo
> (PS: rr is really useful for identifying what's actually happening
> in multi-thread races like this...)
>
> thanks
> -- PMM
>