[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] softmmu thoughts
From: |
Piotras |
Subject: |
Re: [Qemu-devel] softmmu thoughts |
Date: |
Wed, 20 Oct 2004 02:13:24 +0200 |
Hi!
I experimented already with similar approach. I started with Qemu-fast as it
already uses signal handler and mmap to setup guest address space.
Qemu-fast requires that virtual address space visible inside emulator is mapped
directly to qemu process address space. Because of this qemu-fast uses special
memory layout. This means that it will always be much less portable than
qemu-softmmu, but on the other hand is much faster and can support code-copy
to achieve near-native performance.
The goal of my experiment was to evaluate possibility of using mmap-ed memory
to improve speed of softmmu without introducing portability limitations of
Qemu-fast. Because of this I used an indirection table and the memory
access code
was very similar to yours:
mem_read (uint32_t virtual_addr)
{
uint32_t entry;
uint32_t physical_addr;
entry = virtual_addr >> MAP_BLOCK_BITS;
/* the entries in indirection_table compensate for higher bits of
virtual_addr to avoid extra "and" operation */
physical_addr = CPUState->indirection_table[entry] + virtual_addr;
return *(TYPE *)physical_addr;
}
Each indirection_table entry points to a block of 2^(MAP_BLOCK_BITS - 12)+1
pages of virtual memory. Each block contains pages that should be accessible
at continues virtual addresses. Because of this (and +1 in the formula above)
memory access that crosses page boundaries is done full-speed.
I use a pool of blocks that is much smaller then guest virtual address space
and use a special block on inaccessible memory to trap memory access
via entries
of indirection_table that are not mapped to a valid block. If I need
to allocate
a new block and my pool is empty I'm unmapping last-recently allocated block.
The memory access is implemented in 4 x86 instructions:
asm volatile (
"mov %3, %%eax\n"
"shr %2, %%eax\n"
"mov %1(%%ebp,%%eax,4), %%eax\n"
"movl (%3,%%eax,1), %0\n"
: "=r" (result)
: "m" (*(uint8_t *)offsetof(CPUX86State, indirection_table[0])),
"I" (MAP_BLOCK_BITS),
"r" (virtual_addr)
: "%eax");
Only the last instruction can fail. Whenever the signal handler modifies the
indirection_table entry, the new value is stored in the EAX register.
On instruction
restart the new value is used.
I believe that MAP_BLOCK_BITS should be set to value bigger then 12 to limit
size of indirection_table and fragmentation of memory map (mmaped pages).
IIRC, the nbench results were about 20-30% better then traditional
Qemu-softmmu.
Also Linux seemed faster. However Windows 98 seemed much slower. The problem
with Windows is that it does a lot of writes on the very same pages that the
code is executing from and this causes a lot of page faults.
I'm attaching the patch. It's very experimental so you may expect bugs.
Piotrek
On Tue, 19 Oct 2004 22:27:57 +0200, Magnus Damm <address@hidden> wrote:
> Hello all,
>
> Wouldn't it be possible to speed up the softmmu code by using some
> mmap() tricks?
>
> u_int32_t mem_read(u_int32_t address)
> {
> u_int8_t entry;
> u_int32_t a;
>
> entry = CPUState->softmmu_lookup[address >> 12];
> a = CPUState->softmmu_entries[entry].base + (address & 0xfff);
> return *(u_int32_t *)a;
> }
>
> The idea is to optimize so the most common memory accesses becomes
> faster than today but the more uncommon (crossing page boundary) will
> generate a signal and thus become slower. If I remember correctly the
> code above will be around 7 x86 instructions long.
>
> The code above will use 1 MiB of memory for the softmmu_lookup, one byte
> for each entry. A value of 0 means "not mapped" and softmmu_entries[0]
> will always point to a page that generates a signal. The other 255
> entries are used to map one virtual address to a base address of a
> two-page combination somewhere in memory. This two page combination is
> actually two VMA:s where the first page maps to the correct simulated
> physical address. The second page is mapped as inaccessible and is used
> to generate a signal when a memory access crosses the page boundary.
>
> And of course, there are many more things that must be done including a
> complicated signal handler, and I guess that this kind of implementation
> is not really useful for mapping in memory mapped I/O. But maybe it is
> efficient for userspace?
>
> Any thoughts?
>
> / magnus
- [Qemu-devel] softmmu thoughts, Magnus Damm, 2004/10/19
- Re: [Qemu-devel] softmmu thoughts,
Piotras <=
- Re: [Qemu-devel] softmmu thoughts, Piotras, 2004/10/19
- Re: [Qemu-devel] softmmu thoughts, Dan Sands, 2004/10/19
- Re: [Qemu-devel] softmmu thoughts, Piotras, 2004/10/20
- Re: [Qemu-devel] softmmu thoughts, Fabrice Bellard, 2004/10/20
- Re: [Qemu-devel] softmmu thoughts, Martin Garton, 2004/10/20
- Re: [Qemu-devel] softmmu thoughts, André Braga, 2004/10/20
- Re: [Qemu-devel] softmmu thoughts, Magnus Damm, 2004/10/20
- Re: [Qemu-devel] softmmu thoughts, Fabrice Bellard, 2004/10/20
- Re: [Qemu-devel] softmmu thoughts, Magnus Damm, 2004/10/31
Re: [Qemu-devel] softmmu thoughts, Magnus Damm, 2004/10/20