qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 19/45] target/arm: Implement SME MOVA


From: Peter Maydell
Subject: Re: [PATCH v4 19/45] target/arm: Implement SME MOVA
Date: Fri, 1 Jul 2022 17:19:45 +0100

On Tue, 28 Jun 2022 at 05:40, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> We can reuse the SVE functions for implementing moves to/from
> horizontal tile slices, but we need new ones for moves to/from
> vertical tile slices.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

> diff --git a/target/arm/sme_helper.c b/target/arm/sme_helper.c
> index eef2df73e1..95159862de 100644
> --- a/target/arm/sme_helper.c
> +++ b/target/arm/sme_helper.c
> @@ -19,8 +19,10 @@
>
>  #include "qemu/osdep.h"
>  #include "cpu.h"
> -#include "internals.h"

Did you mean to delete this #include line ?

> +#include "tcg/tcg-gvec-desc.h"
>  #include "exec/helper-proto.h"
> +#include "qemu/int128.h"
> +#include "vec_internal.h"
>
>  /* ResetSVEState */
>  void arm_reset_sve_state(CPUARMState *env)
> @@ -84,3 +86,111 @@ void helper_sme_zero(CPUARMState *env, uint32_t imm, 
> uint32_t svl)
>          }
>      }
>  }
> +
> +/*
> + * Move Zreg vector to ZArray column.
> + */
> +#define DO_MOVA_C(NAME, TYPE, H)                                        \
> +void HELPER(NAME)(void *za, void *vn, void *vg, uint32_t desc)          \
> +{                                                                       \
> +    int i, oprsz = simd_oprsz(desc);                                    \
> +    for (i = 0; i < oprsz; ) {                                          \
> +        uint16_t pg = *(uint16_t *)(vg + H1_2(i >> 3));                 \
> +        do {                                                            \
> +            if (pg & 1) {                                               \
> +                *(TYPE *)za = *(TYPE *)(vn + H(i));                     \
> +            }                                                           \
> +            za += sizeof(ARMVectorReg) * sizeof(TYPE);                  \
> +            i += sizeof(TYPE);                                          \
> +            pg >>= sizeof(TYPE);                                        \
> +        } while (i & 15);                                               \
> +    }                                                                   \
> +}
> +
> +DO_MOVA_C(sme_mova_cz_b, uint8_t, H1)
> +DO_MOVA_C(sme_mova_cz_h, uint16_t, H2)
> +DO_MOVA_C(sme_mova_cz_s, uint32_t, H4)
> +
> +void HELPER(sme_mova_cz_d)(void *za, void *vn, void *vg, uint32_t desc)
> +{
> +    int i, oprsz = simd_oprsz(desc) / 8;
> +    uint8_t *pg = vg;
> +    uint64_t *n = vn;
> +    uint64_t *a = za;
> +
> +    for (i = 0; i < oprsz; i++) {
> +        if (pg[H1_2(i)] & 1) {

The documentation of the H macros says
"The H1_<N> macros are used when performing byte arithmetic and then
 casting the final pointer to a type of size N."
but we're not casting anything to a 2-byte type, so what's happening here?

> +            a[i * sizeof(ARMVectorReg)] = n[i];
> +        }
> +    }
> +}
> +
> +void HELPER(sme_mova_cz_q)(void *za, void *vn, void *vg, uint32_t desc)
> +{
> +    int i, oprsz = simd_oprsz(desc) / 16;
> +    uint16_t *pg = vg;
> +    Int128 *n = vn;
> +    Int128 *a = za;
> +
> +    for (i = 0; i < oprsz; i++) {
> +        if (pg[H2(i)] & 1) {
> +            a[i * sizeof(ARMVectorReg)] = n[i];

Is it really OK to do this with an Int128 store? That is
in host-order, but the two halves of a 128-bit quantity
in the ZA array are in architectural order. I suppose the
source also will have them in the architectural order, but
it does look odd, especially uncommented.

> +        }
> +    }
> +}
> +
> +#undef DO_MOVA_C

> diff --git a/target/arm/translate-sme.c b/target/arm/translate-sme.c
> index 971504559b..8e6881086b 100644
> --- a/target/arm/translate-sme.c
> +++ b/target/arm/translate-sme.c
> @@ -35,6 +35,77 @@
>  #include "decode-sme.c.inc"
>
>
> +static TCGv_ptr get_tile_rowcol(DisasContext *s, int esz, int rs,
> +                                int tile_index, bool vertical)
> +{
> +    int tile = tile_index >> (4 - esz);
> +    int index = esz == MO_128 ? 0 : extract32(tile_index, 0, 4 - esz);
> +    int pos, len, offset;
> +    TCGv_i32 t_index;
> +    TCGv_ptr addr;
> +
> +    /* Resolve tile.size[index] to an untyped ZA slice index. */
> +    t_index = tcg_temp_new_i32();
> +    tcg_gen_trunc_tl_i32(t_index, cpu_reg(s, rs));
> +    tcg_gen_addi_i32(t_index, t_index, index);

This code isn't doing what the comment says; it's just calculating
the (not-yet-taken-MOD-dim) slice index, which does depend on the type.

> +
> +    len = ctz32(streaming_vec_reg_size(s)) - esz;

What is this the length of ?

> +
> +    if (vertical) {

I ended up reviewing this function from bottom to top because to
me the horizontal case seemed simpler to understand first.
(As a result the review comments might read a bit out-of-order.)
Dunno if that justifies flipping the condition around, though.

> +        /*
> +         * Compute the offset of the index within the tile:

"byte offset" ?

> +         *     (index % (svl / size)) * size
> +         *   = (index % (svl >> esz)) << esz
> +         * Perform the power-of-two modulo via extraction of the low @len 
> bits.
> +         * Perform the multiply by shifting left by @pos bits.
> +         * These two operations are performed simultaneously via deposit.
> +         */
> +        pos = esz;
> +        tcg_gen_deposit_z_i32(t_index, t_index, pos, len);

As an aside, this appears to be "deposit into a zero value", but
I had to go digging in the source to find that out, because we
don't document the semantics of this either in the header file
where this function is declared, or in a doc comment on the
implementation of it, or in tcg/README, or in docs/devel.
We really desperately need to document the interface the TCG
core code provides to frontends, because it's getting further
and further away from "it's functions to emit the IR ops
described in tcg/README, with obvious looking function names"...

> +
> +        /* The tile slice offset within env->zarray is the column offset. */
> +        offset = tile;

I don't understand why we can just add the tile index
(which is going to be an integer 0, 1, 2..) to a byte offset.
In the horizontal case we add tile * sizeof(ARMVectorReg),
which makes more sense to me.

> +
> +        /* Include the offset of zarray to make this relative to env. */
> +        offset += offsetof(CPUARMState, zarray);
> +        tcg_gen_addi_i32(t_index, t_index, offset);
> +
> +        /*
> +         * For big-endian, adjust the column slice offset within
> +         * the uint64_t host words that make up env->zarray.
> +         * This must wait until index and offset are combined.

Why? Neither the byte-offset of the start of the tile nor
the byte offset of zarray in CPUARMState ought to be non-8-aligned.

> +         */
> +        if (HOST_BIG_ENDIAN && esz < MO_64) {
> +            tcg_gen_xori_i32(t_index, t_index, 8 - (1 << esz));
> +        }
> +    } else {
> +        /*
> +         * Compute the offset of the index within the tile:

"byte offset", right ?

Also this is doing the "take index MOD dim" that the pseudocode
does as a preceding step at the same time, so we should mention that.

> +         *     (index % (svl / size)) * (size * sizeof(row))
> +         *   = (index % (svl >> esz)) << (esz + log2(sizeof(row)))
> +         */
> +        pos = esz + ctz32(sizeof(ARMVectorReg));
> +        tcg_gen_deposit_z_i32(t_index, t_index, pos, len);
> +
> +        /* The tile slice offset within env->zarray is the row offset. */
> +        offset = tile * sizeof(ARMVectorReg);
> +
> +        /* Include the offset of zarray to make this relative to env. */
> +        offset += offsetof(CPUARMState, zarray);
> +        tcg_gen_addi_i32(t_index, t_index, offset);
> +
> +        /* Row slices need no endian adjustment. */
> +    }
> +
> +    /* Add the offset to env to produce the final pointer. */
> +    addr = tcg_temp_new_ptr();
> +    tcg_gen_ext_i32_ptr(addr, t_index);
> +    tcg_temp_free_i32(t_index);
> +    tcg_gen_add_ptr(addr, addr, cpu_env);
> +
> +    return addr;
> +}

-- PMM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]