Hi everybody,
Hi Peter,
I tried to recap what we did, how it "works" and what the status:
All the mechanism are basically unchanged.
A lot of TCG structures are not thread safe.
And all TCG threads can run at the same times and sometimes want to
generate
code at the same time.
Translation block related structure:
struct TBContext {
TranslationBlock *tbs;
TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
int nb_tbs;
/* any access to the tbs or the page table must use this lock */
QemuMutex tb_lock;
/* statistics */
int tb_flush_count;
int tb_phys_invalidate_count;
int tb_invalidated_flag;
};
This structure is used in TCGContext: TBContext tb_ctx;
"tbs" is basically where the translated block are stored and
tb_phys_hash an
hash table to find them quickly.
There are two solutions to prevent thread issues:
A/ Just have two tb_ctx.
B/ Share it between CPUs and protect the tb_ctx access.
We took the second solution so all CPUs can benefit of the translated TB.
TBContext is written almost everywhere in translate-all.c.
When there are too much tbs a tb_flush occurs and destroy the array.
We don't
handle this case right now.
tb_lock is already used by user-mode code, so we just convert it to
QemuMutex so
we can reuse it in system-mode.
struct TCGContext {
uint8_t *pool_cur, *pool_end;
TCGPool *pool_first, *pool_current, *pool_first_large;
TCGLabel *labels;
int nb_labels;
int nb_globals;
int nb_temps;
/* goto_tb support */
tcg_insn_unit *code_buf;
uintptr_t *tb_next;
uint16_t *tb_next_offset;
uint16_t *tb_jmp_offset; /* != NULL if USE_DIRECT_JUMP */
/* liveness analysis */
uint16_t *op_dead_args; /* for each operation, each bit tells if the
corresponding argument is dead */
uint8_t *op_sync_args; /* for each operation, each bit tells if the
corresponding output argument needs to be
sync to memory. */
/* tells in which temporary a given register is. It does not take
into account fixed registers */
int reg_to_temp[TCG_TARGET_NB_REGS];
TCGRegSet reserved_regs;
intptr_t current_frame_offset;
intptr_t frame_start;
intptr_t frame_end;
int frame_reg;
tcg_insn_unit *code_ptr;
TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
GHashTable *helpers;
#ifdef CONFIG_PROFILER
/* profiling info */
int64_t tb_count1;
int64_t tb_count;
int64_t op_count; /* total insn count */
int op_count_max; /* max insn per TB */
int64_t temp_count;
int temp_count_max;
int64_t del_op_count;
int64_t code_in_len;
int64_t code_out_len;
int64_t interm_time;
int64_t code_time;
int64_t la_time;
int64_t opt_time;
int64_t restore_count;
int64_t restore_time;
#endif
#ifdef CONFIG_DEBUG_TCG
int temps_in_use;
int goto_tb_issue_mask;
#endif
uint16_t gen_opc_buf[OPC_BUF_SIZE];
TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];
uint16_t *gen_opc_ptr;
TCGArg *gen_opparam_ptr;
target_ulong gen_opc_pc[OPC_BUF_SIZE];
uint16_t gen_opc_icount[OPC_BUF_SIZE];
uint8_t gen_opc_instr_start[OPC_BUF_SIZE];
/* Code generation. Note that we specifically do not use
tcg_insn_unit
here, because there's too much arithmetic throughout that relies
on addition and subtraction working on bytes. Rely on the GCC
extension that allows arithmetic on void*. */
int code_gen_max_blocks;
void *code_gen_prologue;
void *code_gen_buffer;
size_t code_gen_buffer_size;
/* threshold to flush the translated code buffer */
size_t code_gen_buffer_max_size;
void *code_gen_ptr;
TBContext tb_ctx;
/* The TCGBackendData structure is private to tcg-target.c. */
struct TCGBackendData *be;
};
This structure is used to translate the TBs.
The easier solution was to protect the generation of the code to only
allow one
CPU to generate code at a time. This is normal as we don't want double
generated
tb in the pool anyway. This is achieved with the tb_lock used above.
TLB:
TLB seems to be CPU dependant, so it is not really a problem as in our
implementation one CPU = one pthread. But sometimes a CPU wants to
flush TLB,
through an instruction for example. It is very likely an other CPU in
an other
thread is executing code at the same time. That's why we choose to
create a
tlb_flush_mechanism:
When a CPU wants to flush it asks and wait all CPU to exit TCG and
then exit
itself. This can be reused for tb_invalidate and or tb_flush as well.
Atomic instructions:
Atomic instructions are quite hard to implement.
The TranslationBlock implementing the atomic instruction can't be
interrupted
during the execution (eg: by an interrupt or a signal) cmpxchg64
helper is used
for that.
QEMU's global lock:
TCG thread take the lock during code execution. This is not ok for
multi-thread
because that means only one thread will be running at a time. That's
why we took
Jan's patch to allow TCG to run without the lock and take it when needed.
What is the status:
* We can start a vexpress-a15 simulation with two A15 and run two
dhrystones at
a time, the performance are increased it's quite stable.
What is missing:
* tb_flush is not implemented correctly.
* PageDesc structure is not protected the patch which introduced a
first_tb
array was not the right approach and is removed. This implies that
tb_invalidate is broken.
For both issues we plan to use the same mechanism as tlb_flush:
exiting all the
CPU, flushing, invalidating and let them continue. A generic mechanism
must be
implemented for that.
Known issues:
* GDB stub is broken because it uses tb_invalidate and we didn't
implement that
for now, and there are probably other issues.
* SMP > 2 crashes, probably because of tb_invalidate as well.
* We don't know the status of the user code, which is probably broken
by our
changes.