[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [RFC 20/30] target-i386: remove helper_lock()
From: |
Emilio G. Cota |
Subject: |
[Qemu-devel] [RFC 20/30] target-i386: remove helper_lock() |
Date: |
Mon, 27 Jun 2016 15:02:06 -0400 |
It's been superseded by the atomic helpers.
The use of the atomic helpers provides a significant performance and scalability
improvement. Below is the result of running the atomic_add-test microbenchmark
with:
$ x86_64-linux-user/qemu-x86_64 tests/atomic_add-bench -o 5000000 -r $r -n $n
, where $n is the number of threads and $r is the allowed range for the
additions.
The scenarios measured are:
- atomic: implements x86' ADDL with the atomic_add helper (i.e. this patchset)
- cmpxchg: implement x86' ADDL with a TCG loop using the cmpxchg helper
- master: before this patchset
Results sorted in ascending range, i.e. descending degree of contention.
Y axis is Throughput in Mops/s. Tests are run on an AMD machine with 64
Opteron 6376 cores.
atomic_add-bench: 5000000 ops/thread, [0,1] range
25 ++---------+----------+---------+----------+----------+----------+---++
+ atomic +-E--+ + + + + + |
|cmpxchg +-H--+ |
20 +Emaster +-N--+ ++
|| |
|++ |
|| |
15 +++ ++
|N| |
|+| |
10 ++| ++
|+|+ |
| | -+E+------ +++ ---+E+------+E+------+E+-----+E+------+E|
|+E+E+- +++ +E+------+E+-- |
5 ++|+ ++
|+N+H+--- +++ |
++++N+--+H++----+++ + +++ --++H+------+H+------+H++----+H+---+--- |
0 ++---------+-----H----+---H-----+----------+----------+----------+---H+
0 10 20 30 40 50 60
Number of threads
atomic_add-bench: 5000000 ops/thread, [0,2] range
25 ++---------+----------+---------+----------+----------+----------+---++
++atomic +-E--+ + + + + + |
|cmpxchg +-H--+ |
20 ++master +-N--+ ++
|E| |
|++ |
||E |
15 ++| ++
|N|| |
|+|| ---+E+------+E+-----+E+------+E|
10 ++| | ---+E+------+E+-----+E+--- +++ +++
||H+E+--+E+-- |
|+++++ |
| || |
5 ++|+H+-- +++ ++
|+N+ - ---+H+------+H+------ |
+ +N+--+H++----+H+---+--+H+----++H+--- + + +H+---+--+H|
0 ++---------+----------+---------+----------+----------+----------+---++
0 10 20 30 40 50 60
Number of threads
atomic_add-bench: 5000000 ops/thread, [0,8] range
40 ++---------+----------+---------+----------+----------+----------+---++
++atomic +-E--+ + + + + + |
35 +cmpxchg +-H--+ ++
| master +-N--+ ---+E+------+E+------+E+-----+E+------+E|
30 ++| ---+E+-- +++ ++
| | -+E+--- |
25 ++E ---- +++ ++
|+++++ -+E+ |
20 +E+ E-- +++ ++
|H|+++ |
|+| +H+------- |
15 ++H+ ---+++ +H+------ ++
|N++H+-- +++--- +H+------++|
10 ++ +++ - +++ ---+H+ +++ +H+
| | +H+-----+H+------+H+-- |
5 ++| +++ ++
++N+N+--+N++ + + + + + |
0 ++---------+----------+---------+----------+----------+----------+---++
0 10 20 30 40 50 60
Number of threads
atomic_add-bench: 5000000 ops/thread, [0,128] range
160 ++---------+---------+----------+---------+----------+----------+---++
+ atomic +-E--+ + + + + + |
140 +cmpxchg +-H--+ +++ +++ ++
| master +-N--+ E--------E------+E+------++|
120 ++ --| | +++ E+
| -- +++ +++ ++|
100 ++ - ++
| +++- +++ ++|
80 ++ -+E+ -+H+------+H+------H--------++
| ---- ---- +++ H|
| ---+E+-----+E+- ---+H+ ++|
60 ++ +E+--- +++ ---+H+--- ++
| --+++ ---+H+-- |
40 ++ +E+-+H+--- ++
| +H+ |
20 +EE+ ++
+N+ + + + + + + |
0 ++N-N---N--+---------+----------+---------+----------+----------+---++
0 10 20 30 40 50 60
Number of threads
atomic_add-bench: 5000000 ops/thread, [0,1024] range
350 ++---------+---------+----------+---------+----------+----------+---++
+ atomic +-E--+ + + + + + |
300 +cmpxchg +-H--+ +++
| master +-N--+ +++ ||
| +++ | ----E|
250 ++ | ----E---- ++
| ----E--- | ---+H|
200 ++ -+E+--- +++ ---+H+--- ++
| ---- -+H+-- |
| +E+ +++ ---- +++ |
150 ++ ---+++ ---+H+- ++
| --- -+H+-- |
100 ++ ---+E+ ---- +++ ++
| +++ ---+E+-----+H+- |
| -+E+------+H+-- |
50 ++ +E+ ++
+EE+ + + + + + + |
0 ++N-N---N--+---------+----------+---------+----------+----------+---++
0 10 20 30 40 50 60
Number of threads
hi-res: http://imgur.com/a/fMRmq
For master I stopped measuring master after 8 threads, because there is little
point in measuring the well-known performance collapse of a contended lock.
Note that using atomic helpers instead of cmpxchg is not only more performant
(when the host implements natively the same atomics, as is in this case); it
also
simplifies the necessary TCG/helper code of the implementation. Compare
the difference between the atomic and cmpxchg implementations:
case OP_ADDL:
if (s1->prefix & PREFIX_LOCK) {
- gen_atomic_add_fetch(cpu_T0, cpu_A0, cpu_T1, ot);
+ TCGv t0, t1, t2, a0;
+ TCGLabel *retry;
+
+ t0 = tcg_temp_local_new();
+ t1 = tcg_temp_local_new();
+ t2 = tcg_temp_local_new();
+ a0 = tcg_temp_local_new();
+ retry = gen_new_label();
+
+ tcg_gen_mov_tl(a0, cpu_A0);
+ tcg_gen_mov_tl(t1, cpu_T1);
+ gen_set_label(retry);
+ gen_op_ld_v(s1, ot, cpu_T0, a0);
+ tcg_gen_add_tl(t2, cpu_T0, t1);
+ gen_cmpxchg(t0, a0, cpu_T0, t2, ot);
+ tcg_gen_brcond_tl(TCG_COND_NE, t0, cpu_T0, retry);
+
+ tcg_gen_mov_tl(cpu_T0, t2);
+ tcg_gen_mov_tl(cpu_T1, t1);
+
+ tcg_temp_free(t0);
+ tcg_temp_free(t1);
+ tcg_temp_free(t2);
+ tcg_temp_free(a0);
} else {
Signed-off-by: Emilio G. Cota <address@hidden>
---
target-i386/helper.h | 2 --
target-i386/mem_helper.c | 33 ---------------------------------
target-i386/translate.c | 15 ---------------
3 files changed, 50 deletions(-)
diff --git a/target-i386/helper.h b/target-i386/helper.h
index df68204..17a750b 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -1,8 +1,6 @@
DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
-DEF_HELPER_0(lock, void)
-DEF_HELPER_0(unlock, void)
DEF_HELPER_3(write_eflags, void, env, tl, i32)
DEF_HELPER_1(read_eflags, tl, env)
DEF_HELPER_2(divb_AL, void, env, tl)
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index 13f4f3b..949666a 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -23,39 +23,6 @@
#include "exec/exec-all.h"
#include "exec/cpu_ldst.h"
-/* broken thread support */
-
-#if defined(CONFIG_USER_ONLY)
-QemuMutex global_cpu_lock;
-
-void helper_lock(void)
-{
- qemu_mutex_lock(&global_cpu_lock);
-}
-
-void helper_unlock(void)
-{
- qemu_mutex_unlock(&global_cpu_lock);
-}
-
-void helper_lock_init(void)
-{
- qemu_mutex_init(&global_cpu_lock);
-}
-#else
-void helper_lock(void)
-{
-}
-
-void helper_unlock(void)
-{
-}
-
-void helper_lock_init(void)
-{
-}
-#endif
-
#define GEN_CMPXCHG_HELPER(NAME) \
target_ulong glue(helper_, NAME)(CPUX86State *env, target_ulong addr, \
target_ulong old, target_ulong new) \
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 48c6742..8365c83 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -4636,10 +4636,6 @@ static target_ulong disas_insn(CPUX86State *env,
DisasContext *s,
s->aflag = aflag;
s->dflag = dflag;
- /* lock generation */
- if (prefixes & PREFIX_LOCK)
- gen_helper_lock();
-
/* now check op code */
reswitch:
switch(b) {
@@ -8304,20 +8300,11 @@ static target_ulong disas_insn(CPUX86State *env,
DisasContext *s,
default:
goto unknown_op;
}
- /* lock generation */
- if (s->prefix & PREFIX_LOCK)
- gen_helper_unlock();
return s->pc;
illegal_op:
- if (s->prefix & PREFIX_LOCK)
- gen_helper_unlock();
- /* XXX: ensure that no lock was generated */
gen_illegal_opcode(s);
return s->pc;
unknown_op:
- if (s->prefix & PREFIX_LOCK)
- gen_helper_unlock();
- /* XXX: ensure that no lock was generated */
gen_unknown_opcode(env, s);
return s->pc;
}
@@ -8409,8 +8396,6 @@ void tcg_x86_init(void)
offsetof(CPUX86State, bnd_regs[i].ub),
bnd_regu_names[i]);
}
-
- helper_lock_init();
}
/* generate intermediate code for basic block 'tb'. */
--
2.5.0
- [Qemu-devel] [RFC 06/30] target-i386: emulate LOCK'ed cmpxchg8b/16b using cmpxchg helpers, (continued)
- [Qemu-devel] [RFC 06/30] target-i386: emulate LOCK'ed cmpxchg8b/16b using cmpxchg helpers, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 05/30] target-i386: emulate LOCK'ed cmpxchg using cmpxchg helpers, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 13/30] target-i386: emulate LOCK'ed INC using atomic helper, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 15/30] target-i386: emulate LOCK'ed NEG using cmpxchg helper, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 27/30] target-arm: emulate aarch64's LL/SC using cmpxchg helpers, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 25/30] helper: add DEF_HELPER_6, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 16/30] target-i386: emulate LOCK'ed XADD using atomic helper, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 23/30] target-arm: add atomic_xchg helper, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 22/30] target-arm: emulate LL/SC using cmpxchg helpers, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 29/30] linux-user: remove handling of aarch64's EXCP_STREX, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 20/30] target-i386: remove helper_lock(),
Emilio G. Cota <=
- [Qemu-devel] [RFC 26/30] target-arm: add cmpxchg helpers for aarch64, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 17/30] target-i386: emulate LOCK'ed BTX ops using atomic helpers, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 24/30] target-arm: emulate SWP with atomic_xchg helper, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 30/30] target-arm: remove EXCP_STREX + cpu_exclusive_{test, info}, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 18/30] target-i386: emulate XCHG using atomic helper, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 28/30] linux-user: remove handling of ARM's EXCP_STREX, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 21/30] target-arm: add cmpxchg helpers, Emilio G. Cota, 2016/06/27
- [Qemu-devel] [RFC 19/30] tests: add atomic_add-bench, Emilio G. Cota, 2016/06/27
- Re: [Qemu-devel] [RFC 00/30] cmpxchg-based emulation of atomics, LluĂs Vilanova, 2016/06/28