Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride

qemu-riscv

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride

From:	Max Chou
Subject:	Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions
Date:	Sat, 17 Feb 2024 17:52:10 +0800
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.14.0

Hi Richard,

Thank you for the suggestion and the reference.

I'm trying to follow the reference to implement it and I'll send anotherversion for this.


Thanks a lot,
Max

On 2024/2/16 4:24 AM, Richard Henderson wrote:

On 2/15/24 09:28, Max Chou wrote:
Hi all,

When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x
slower than the scalar equivalent on QEMU and it hurts developer
productivity.

 From the performance analysis result, we can observe that the glibc
memcpy spends most of the time in the vector unit-stride load/store
helper functions.

Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664
   Children      Self  Command       Shared Object Symbol
+ 28.46% 27.85% qemu-riscv64 qemu-riscv64 [.]vext_ldst_us+ 26.92% 0.00% qemu-riscv64 [unknown] [.]0x00000000000000ff+ 14.41% 14.41% qemu-riscv64 qemu-riscv64 [.]qemu_plugin_vcpu_mem_cb
+   13.85%    13.85%  qemu-riscv64  qemu-riscv64             [.] lde_b
+ 13.64% 13.64% qemu-riscv64 qemu-riscv64 [.]cpu_stb_mmu+ 9.25% 9.19% qemu-riscv64 qemu-riscv64 [.]cpu_ldb_mmu+ 7.81% 7.81% qemu-riscv64 qemu-riscv64 [.]cpu_mmu_lookup
+    7.70%     7.70%  qemu-riscv64  qemu-riscv64             [.] ste_b
+ 5.53% 0.00% qemu-riscv64 qemu-riscv64 [.]adjust_addr (inlined)
So this patchset tries to improve the performance of the RVV version of
glibc memcpy on QEMU by improving the corresponding helper function
quality.

The overall performance improvement can achieve following numbers
(depending on the size).
Average: 2.86X / Smallest: 1.15X / Largest: 4.49X

PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions,
the next version or next serious will complete other vector ld/st part.
You are still not tackling the root problem, which is over-use of thefull out-of-line load/store routines. The reason that cpu_mmu_lookupis in that list is because you are performing the full virtual addressresolution for each and every byte.
The only way to make a real improvement is to perform virtual addressresolution *once* for the entire vector. I refer to my previous advice:
https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369


r~

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance, (continued)
- [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb, Max Chou, 2024/02/15
  - Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb, Richard Henderson, 2024/02/15
    - Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb, Max Chou, 2024/02/17
  - Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb, Daniel Henrique Barboza, 2024/02/15
    - Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb, Max Chou, 2024/02/17
- [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function, Max Chou, 2024/02/15
  - Re: [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function, Richard Henderson, 2024/02/15
- [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function, Max Chou, 2024/02/15
- Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions, Richard Henderson, 2024/02/15
  - Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions, Max Chou <=

Prev by Date: Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb
Next by Date: Re: [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance
Previous by thread: Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions
Next by thread: [PATCH v4 0/6] riscv: named features riscv,isa, 'svade' rework
Index(es):
- Date
- Thread