qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.


From: Frederic Konrad
Subject: Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.
Date: Wed, 22 Apr 2015 14:26:14 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0

On 10/04/2015 18:03, Frederic Konrad wrote:
On 30/03/2015 23:46, Peter Maydell wrote:
On 30 March 2015 at 07:52, Mark Burton <address@hidden> wrote:
So - Fred is unwilling to send the patch set as it stands, because frankly this part is totally broken.

There is an independent patch set that needs splitting out which deals with just the atomic instruction issue - specifically for ARM (though I guess it’s applicable across the board)…

So - in short - I HOPE to get the patch set onto the reflector sometime next week, and I’m sorry for the delay.
What I really want to see is not so much the patch set
but the design sketch I asked for that lists the
various data structures and indicates which ones
are going to be per-cpu, which ones will be shared
(and with what locking), etc.

-- PMM

Does that makes sense?

BTW here is the repository:
git clone address@hidden:fkonrad/mttcg.git -b multi_tcg_v4

Thanks,
Fred

Hi everybody,
Hi Peter,

I tried to recap what we did, how it "works" and what the status:

All the mechanism are basically unchanged.

A lot of TCG structures are not thread safe.
And all TCG threads can run at the same times and sometimes want to generate
code at the same time.

Translation block related structure:

struct TBContext {

    TranslationBlock *tbs;
    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
    int nb_tbs;
    /* any access to the tbs or the page table must use this lock */
    QemuMutex tb_lock;

    /* statistics */
    int tb_flush_count;
    int tb_phys_invalidate_count;

    int tb_invalidated_flag;
};

This structure is used in TCGContext: TBContext tb_ctx;

"tbs" is basically where the translated block are stored and tb_phys_hash an
hash table to find them quickly.

There are two solutions to prevent thread issues:
  A/ Just have two tb_ctx.
  B/ Share it between CPUs and protect the tb_ctx access.

We took the second solution so all CPUs can benefit of the translated TB.
TBContext is written almost everywhere in translate-all.c.
When there are too much tbs a tb_flush occurs and destroy the array. We don't
handle this case right now.
tb_lock is already used by user-mode code, so we just convert it to QemuMutex so
we can reuse it in system-mode.

struct TCGContext {
    uint8_t *pool_cur, *pool_end;
    TCGPool *pool_first, *pool_current, *pool_first_large;
    TCGLabel *labels;
    int nb_labels;
    int nb_globals;
    int nb_temps;

    /* goto_tb support */
    tcg_insn_unit *code_buf;
    uintptr_t *tb_next;
    uint16_t *tb_next_offset;
    uint16_t *tb_jmp_offset; /* != NULL if USE_DIRECT_JUMP */

    /* liveness analysis */
    uint16_t *op_dead_args; /* for each operation, each bit tells if the
                               corresponding argument is dead */
    uint8_t *op_sync_args;  /* for each operation, each bit tells if the
                               corresponding output argument needs to be
                               sync to memory. */

    /* tells in which temporary a given register is. It does not take
       into account fixed registers */
    int reg_to_temp[TCG_TARGET_NB_REGS];
    TCGRegSet reserved_regs;
    intptr_t current_frame_offset;
    intptr_t frame_start;
    intptr_t frame_end;
    int frame_reg;

    tcg_insn_unit *code_ptr;
    TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
    TCGTempSet free_temps[TCG_TYPE_COUNT * 2];

    GHashTable *helpers;

#ifdef CONFIG_PROFILER
    /* profiling info */
    int64_t tb_count1;
    int64_t tb_count;
    int64_t op_count; /* total insn count */
    int op_count_max; /* max insn per TB */
    int64_t temp_count;
    int temp_count_max;
    int64_t del_op_count;
    int64_t code_in_len;
    int64_t code_out_len;
    int64_t interm_time;
    int64_t code_time;
    int64_t la_time;
    int64_t opt_time;
    int64_t restore_count;
    int64_t restore_time;
#endif

#ifdef CONFIG_DEBUG_TCG
    int temps_in_use;
    int goto_tb_issue_mask;
#endif

    uint16_t gen_opc_buf[OPC_BUF_SIZE];
    TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];

    uint16_t *gen_opc_ptr;
    TCGArg *gen_opparam_ptr;
    target_ulong gen_opc_pc[OPC_BUF_SIZE];
    uint16_t gen_opc_icount[OPC_BUF_SIZE];
    uint8_t gen_opc_instr_start[OPC_BUF_SIZE];

/* Code generation. Note that we specifically do not use tcg_insn_unit
       here, because there's too much arithmetic throughout that relies
       on addition and subtraction working on bytes.  Rely on the GCC
       extension that allows arithmetic on void*.  */
    int code_gen_max_blocks;
    void *code_gen_prologue;
    void *code_gen_buffer;
    size_t code_gen_buffer_size;
    /* threshold to flush the translated code buffer */
    size_t code_gen_buffer_max_size;
    void *code_gen_ptr;

    TBContext tb_ctx;

    /* The TCGBackendData structure is private to tcg-target.c. */
    struct TCGBackendData *be;
};

This structure is used to translate the TBs.
The easier solution was to protect the generation of the code to only allow one CPU to generate code at a time. This is normal as we don't want double generated
tb in the pool anyway. This is achieved with the tb_lock used above.

TLB:

TLB seems to be CPU dependant, so it is not really a problem as in our
implementation one CPU = one pthread. But sometimes a CPU wants to flush TLB, through an instruction for example. It is very likely an other CPU in an other thread is executing code at the same time. That's why we choose to create a
tlb_flush_mechanism:
When a CPU wants to flush it asks and wait all CPU to exit TCG and then exit
itself. This can be reused for tb_invalidate and or tb_flush as well.

Atomic instructions:

Atomic instructions are quite hard to implement.
The TranslationBlock implementing the atomic instruction can't be interrupted during the execution (eg: by an interrupt or a signal) cmpxchg64 helper is used
for that.

QEMU's global lock:

TCG thread take the lock during code execution. This is not ok for multi-thread because that means only one thread will be running at a time. That's why we took
Jan's patch to allow TCG to run without the lock and take it when needed.

What is the status:

* We can start a vexpress-a15 simulation with two A15 and run two dhrystones at
   a time, the performance are increased it's quite stable.

What is missing:

 * tb_flush is not implemented correctly.
* PageDesc structure is not protected the patch which introduced a first_tb
   array was not the right approach and is removed. This implies that
   tb_invalidate is broken.

For both issues we plan to use the same mechanism as tlb_flush: exiting all the CPU, flushing, invalidating and let them continue. A generic mechanism must be
implemented for that.

Known issues:

* GDB stub is broken because it uses tb_invalidate and we didn't implement that
   for now, and there are probably other issues.
 * SMP > 2 crashes, probably because of tb_invalidate as well.
* We don't know the status of the user code, which is probably broken by our
   changes.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]