Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

From:	Richard Henderson
Subject:	Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
Date:	Mon, 27 Mar 2017 20:57:32 +1000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 03/26/2017 02:52 AM, Pranith Kumar wrote:

Hello,

With MTTCG code now merged in mainline, I tried to see if we are able to run
x86 SMP guests on ARM64 hosts. For this I tried running a windows XP guest on
a dragonboard 410c which has 1GB RAM. Since x86 has a strong memory model
whereas ARM64 is a weak memory model, I added a patch to generate fence
instructions for every guest memory access. After some minor fixes, I was
successfully able to boot a 4 core guest all the way to the desktop (albeit
with a 1GB backing swap). However the performance is severely
limited and the guest is barely usable. Based on my observations, I think
there are some easily implementable additions we can make to improve the
performance of TCG in general and on ARM64 in particular. I propose to do the
following as part of Google Summer of Code 2017.


* Implement jump-to-register instruction on ARM64 to overcome the 128MB
  translation cache size limit.

  The translation cache size for an ARM64 host is currently limited to 128
  MB. This limitation is imposed by utilizing a branch instruction which
  encodes the jump offset and is limited by the number of bits it can use for
  the range of the offset. The performance impact by this limitation is severe
  and can be observed when you try to run large programs like a browser in the
  guest. The cache is flushed several times before the browser starts and the
  performance is not satisfactory. This limitation can be overcome by
  generating a branch-to-register instruction and utilizing that when the
  destination address is outside the range of what can be encoded in current
  branch instruction.

128MB is really quite large. I doubt doubling the cache size will really helpthat much. That said, it's really quite trivial to make this change, if you'dlike to experiment.

FWIW, I rarely see TB flushes for alpha -- not one during an entire gccbootstrap. Now, this is usually with 4GB ram, which by default implies 512MBtranslation cache. But it does mean that, given an ideal guest, TB flushesshould not dominate anything at all.

If you're seeing multiple flushes during the startup of a browser, your guestmust be flushing for other reasons than the code_gen_buffer being full.

* Implement an LRU translation block code cache.

  In the current TCG design, when the translation cache fills up, we flush all
  the translated blocks (TBs) to free up space. We can improve this situation
  by not flushing the TBs that were recently used i.e., by implementing an LRU
  policy for freeing the blocks. This should avoid the re-translation overhead
  for frequently used blocks and improve performance.


The major problem you'll encounter is how to manage allocation in this case.

The current mechanism means that it is trivial to not know how much code isgoing to be generated for a given set of TCG opcodes. When we reach thehigh-water mark, we've run out of room. We then flush everything and startover at the beginning of the buffer.

If you manage the cache with an allocator, you'll need to know in advance howmuch code is going to be generated. This is going to require that you either(1) severely over-estimate the space required (qemu_ld generates lots more codethan just add), (2) severely increase the time required, by generating codetwice, or (3) somewhat increase the time required, by generatingposition-independent code into an external buffer and copying it into placeafter determining the size.

* Avoid consistency overhead for strong memory model guests by generating
  load-acquire and store-release instructions.

This is probably required for good performance of the user-only code path, butconsidering the number of other insns required for the system tlb lookup, I'msurprised that the memory barrier matters.

Please let me know if you have any comments or suggestions. Also please let me
know if there are other enhancements that are easily implementable to increase
TCG performance as part of this project or otherwise.

I think it would be interesting to place TranslationBlock structures into thesame memory block as code_gen_buffer, immediately before the code thatimplements the TB.


Consider what happens within every TB:

(1) We have one or more references to the TB address, via exit_tb.

For aarch64, this will normally require 2-4 insns.

# alpha-softmmu
0x7f75152114:  d0ffb320      adrp x0, #-0x99a000 (addr 0x7f747b8000)
0x7f75152118:  91004c00      add x0, x0, #0x13 (19)
0x7f7515211c:  17ffffc3      b #-0xf4 (addr 0x7f75152028)

# alpha-linux-user
0x00569500:  d2800260      mov x0, #0x13
0x00569504:  f2b59820      movk x0, #0xacc1, lsl #16
0x00569508:  f2c00fe0      movk x0, #0x7f, lsl #32
0x0056950c:  17ffffdf      b #-0x84 (addr 0x569488)

We would reduce this to one insn, always, if the TB were close by, since theADR instruction has a range of 1MB.



(2) We have zero to two references to a linked TB, via goto_tb.

Your stated goal above for eliminating the code_gen_buffer maximum of 128MB canbe done in two ways.

(2A) Raise the maximum to 2GB. For this we would align an instruction pair,adrp+add, to compute the address; the following insn would branch. The updatecode would write a new destination by modifing the adrp+add with a single64-bit store.

(2B) Eliminate the maximum altogether by referencing the destination directlyin the TB. This is the !USE_DIRECT_JUMP path. It is normally not used on64-bit targets because computing the full 64-bit address of the TB is harder,or just as hard, as computing the full 64-bit address of the destination.

However, if the TB is nearby, aarch64 can load the address fromTB.jmp_target_addr in one insn, with LDR (literal). This pc-relative load alsohas a 1MB range.

This has the side benefit that it is much quicker to re-link TBs, both in thecomputation of the code for the destination as well as re-flushing the icache.

In addition, I strongly suspect the 1,342,177 entries (153MB) that we currentlyallocate for tcg_ctx.tb_ctx.tbs, given a 512MB code_gen_buffer, is excessive.

If we co-allocate the TB and the code, then we get exactly the right number ofTBs allocated with no further effort.

There will be some additional memory wastage, since we'll want to keep the codeand the data in different cache lines and that means padding, but I don't thinkthat'll be significant. Indeed, given the above over-allocation will probablystill be a net savings.

r~

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Pranith Kumar, 2017/03/25
- Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Richard Henderson <=
  - Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Alex Bennée, 2017/03/27
  - Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Pranith Kumar, 2017/03/27
    - Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Pranith Kumar, 2017/03/27
    - Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Stefan Hajnoczi, 2017/03/28
- Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Paolo Bonzini, 2017/03/27
  - Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Pranith Kumar, 2017/03/27
- Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Stefan Hajnoczi, 2017/03/27
  - Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements, Pranith Kumar, 2017/03/27

Prev by Date: Re: [Qemu-devel] [PATCH 17/51] ram: Move xbzrle_bytes into RAMState
Next by Date: Re: [Qemu-devel] [[RFC][Bugfix:isapc lapic state]] Bugfix: isapc:apic_state ?Start QEMU with "qemu-system-x86_64 -nographic -M isapc -serial none -monitor stdio" ?and enter "info lapic" at the monitor prompt ⇒ Segmentation fault
Previous by thread: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
Next by thread: Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
Index(es):
- Date
- Thread