[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] Don't lookup full CPU state in the indirect branch fast path o

From: Richard Henderson
Subject: Re: [RFC] Don't lookup full CPU state in the indirect branch fast path on AArch64 when running in user mode.
Date: Mon, 19 Oct 2020 11:22:52 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

On 9/29/20 2:32 PM, Owen Anderson wrote:
> Hello,
> I would like to request feedback on the following patch, which I do
> not believe should be applied to master as-is.  The idea here is to
> avoid gathering the full CPU state in the fast path of an indirect
> branch lookup when running in user mode on a platform where the flags
> can only be changed in privileged mode.  I believe this is true on the
> AArch64 scenario that I care about, but clearly not true in general.
> I'm particularly seeking feedback on how to clean this up into a
> version that checks the correct necessary and sufficient conditions to
> allow all users that can benefit from it to do so.
> On the workload that I am targeting (aarch64 on x86), this patch
> reduces execution wall time by approximately 20%, and eliminates
> indirect branch lookups from the hot stack traces entirely.

(1) What qemu version are you looking at and,
(2) Do you have --enable-tcg-debug enabled?

Because you should not be seeing anything even close to 20% overhead.

In e979972a6a1 (included in qemu 4.2), the AArch64 path is

    uint32_t flags = env->hflags;
    *cs_base = 0;
    if (FIELD_EX32(flags, TBFLAG_ANY, AARCH64_STATE)) {
        *pc = env->pc;
        if (cpu_isar_feature(aa64_bti, env_archcpu(env))) {
            flags = FIELD_DP32(flags, TBFLAG_A64,
               BTYPE, env->btype);
        pstate_for_ss = env->pstate;
    if (FIELD_EX32(flags, TBFLAG_ANY, SS_ACTIVE) &&
        (pstate_for_ss & PSTATE_SS)) {
        flags = FIELD_DP32(flags, TBFLAG_ANY, PSTATE_SS, 1);
    *pflags = flags;

With --enable-tcg-debug, there is an additional step wherein we validate that
env->hflags has the correct value.  Which has caught a number of bugs.

With a silly testcase like so:

    for (int x = 0; x < 10000000; ++x) {
        void *tmp;
        asm volatile("adr %0,1f; br %0; 1:" : "=r"(tmp));

I see cpu_get_tb_cpu_state no higher than 10% of the total runtime.  Which, I
admit is higher than I expected, but still nothing like what you're reporting.
 And a "reasonable" test case should surely have a lower proportion of indirect
branches per dynamic instruction.

> +#if !defined(TARGET_ARM) || !defined(CONFIG_USER_ONLY)
>      cpu_get_tb_cpu_state(env, pc, cs_base, flags);
> +#else
> +    if (is_a64(env)) {
> +      *pc = env->pc;
> +    } else {
> +      *pc = env->regs[15];
> +    }
> +#endif
> +#if !defined(TARGET_ARM) || !defined(CONFIG_USER_ONLY)
>                 tb->cs_base == *cs_base &&
>                 tb->flags == *flags &&
> +#endif

This is assuming that all TB have the same flags, and thus the flags don't need
comparing.  Which is false, even for CONFIG_USER_ONLY.

I would guess that testing -cpu cortex-a57 does not use any of the bits that
might change, but post v8.2 will: SVE, BTI, MTE.  So, this change breaks stuff.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]