[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Stable-8.2.7 27/53] target/arm: Handle denormals correctly for FMOPA (w
From: |
Michael Tokarev |
Subject: |
[Stable-8.2.7 27/53] target/arm: Handle denormals correctly for FMOPA (widening) |
Date: |
Fri, 6 Sep 2024 09:53:57 +0300 |
From: Peter Maydell <peter.maydell@linaro.org>
The FMOPA (widening) SME instruction takes pairs of half-precision
floating point values, widens them to single-precision, does a
two-way dot product and accumulates the results into a
single-precision destination. We don't quite correctly handle the
FPCR bits FZ and FZ16 which control flushing of denormal inputs and
outputs. This is because at the moment we pass a single float_status
value to the helper function, which then uses that configuration for
all the fp operations it does. However, because the inputs to this
operation are float16 and the outputs are float32 we need to use the
fp_status_f16 for the float16 input widening but the normal fp_status
for everything else. Otherwise we will apply the flushing control
FPCR.FZ16 to the 32-bit output rather than the FPCR.FZ control, and
incorrectly flush a denormal output to zero when we should not (or
vice-versa).
(In commit 207d30b5fdb5b we tried to fix the FZ handling but
didn't get it right, switching from "use FPCR.FZ for everything" to
"use FPCR.FZ16 for everything".)
(Mjt: it is commit 4975f9fc4ea3 in stable-8.2)
Pass the CPU env to the sme_fmopa_h helper instead of an fp_status
pointer, and have the helper pass an extra fp_status into the
f16_dotadd() function so that we can use the right status for the
right parts of this operation.
Cc: qemu-stable@nongnu.org
Fixes: 207d30b5fdb5 ("target/arm: Use FPST_F16 for SME FMOPA (widening)")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2373
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
(cherry picked from commit 55f9f4ee018c5ccea81d8c8c586756d7711ae46f)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 27eef49a11..d22bf9d21b 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -121,7 +121,7 @@ DEF_HELPER_FLAGS_5(sme_addha_d, TCG_CALL_NO_RWG, void, ptr,
ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(sme_addva_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_7(sme_fmopa_h, TCG_CALL_NO_RWG,
- void, ptr, ptr, ptr, ptr, ptr, ptr, i32)
+ void, ptr, ptr, ptr, ptr, ptr, env, i32)
DEF_HELPER_FLAGS_7(sme_fmopa_s, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_7(sme_fmopa_d, TCG_CALL_NO_RWG,
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index f9001f5213..3906bb51c0 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -976,12 +976,23 @@ static inline uint32_t f16mop_adj_pair(uint32_t pair,
uint32_t pg, uint32_t neg)
}
static float32 f16_dotadd(float32 sum, uint32_t e1, uint32_t e2,
- float_status *s_std, float_status *s_odd)
+ float_status *s_f16, float_status *s_std,
+ float_status *s_odd)
{
- float64 e1r = float16_to_float64(e1 & 0xffff, true, s_std);
- float64 e1c = float16_to_float64(e1 >> 16, true, s_std);
- float64 e2r = float16_to_float64(e2 & 0xffff, true, s_std);
- float64 e2c = float16_to_float64(e2 >> 16, true, s_std);
+ /*
+ * We need three different float_status for different parts of this
+ * operation:
+ * - the input conversion of the float16 values must use the
+ * f16-specific float_status, so that the FPCR.FZ16 control is applied
+ * - operations on float32 including the final accumulation must use
+ * the normal float_status, so that FPCR.FZ is applied
+ * - we have pre-set-up copy of s_std which is set to round-to-odd,
+ * for the multiply (see below)
+ */
+ float64 e1r = float16_to_float64(e1 & 0xffff, true, s_f16);
+ float64 e1c = float16_to_float64(e1 >> 16, true, s_f16);
+ float64 e2r = float16_to_float64(e2 & 0xffff, true, s_f16);
+ float64 e2c = float16_to_float64(e2 >> 16, true, s_f16);
float64 t64;
float32 t32;
@@ -1003,20 +1014,23 @@ static float32 f16_dotadd(float32 sum, uint32_t e1,
uint32_t e2,
}
void HELPER(sme_fmopa_h)(void *vza, void *vzn, void *vzm, void *vpn,
- void *vpm, void *vst, uint32_t desc)
+ void *vpm, CPUARMState *env, uint32_t desc)
{
intptr_t row, col, oprsz = simd_maxsz(desc);
uint32_t neg = simd_data(desc) * 0x80008000u;
uint16_t *pn = vpn, *pm = vpm;
- float_status fpst_odd, fpst_std;
+ float_status fpst_odd, fpst_std, fpst_f16;
/*
- * Make a copy of float_status because this operation does not
- * update the cumulative fp exception status. It also produces
- * default nans. Make a second copy with round-to-odd -- see above.
+ * Make copies of fp_status and fp_status_f16, because this operation
+ * does not update the cumulative fp exception status. It also
+ * produces default NaNs. We also need a second copy of fp_status with
+ * round-to-odd -- see above.
*/
- fpst_std = *(float_status *)vst;
+ fpst_f16 = env->vfp.fp_status_f16;
+ fpst_std = env->vfp.fp_status;
set_default_nan_mode(true, &fpst_std);
+ set_default_nan_mode(true, &fpst_f16);
fpst_odd = fpst_std;
set_float_rounding_mode(float_round_to_odd, &fpst_odd);
@@ -1036,7 +1050,8 @@ void HELPER(sme_fmopa_h)(void *vza, void *vzn, void *vzm,
void *vpn,
uint32_t m = *(uint32_t *)(vzm + H1_4(col));
m = f16mop_adj_pair(m, pcol, 0);
- *a = f16_dotadd(*a, n, m, &fpst_std, &fpst_odd);
+ *a = f16_dotadd(*a, n, m,
+ &fpst_f16, &fpst_std, &fpst_odd);
}
col += 4;
pcol >>= 4;
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index a50a419af2..ae42ddef7b 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -334,8 +334,29 @@ static bool do_outprod_fpst(DisasContext *s, arg_op *a,
MemOp esz,
return true;
}
-TRANS_FEAT(FMOPA_h, aa64_sme, do_outprod_fpst, a,
- MO_32, FPST_FPCR_F16, gen_helper_sme_fmopa_h)
+static bool do_outprod_env(DisasContext *s, arg_op *a, MemOp esz,
+ gen_helper_gvec_5_ptr *fn)
+{
+ int svl = streaming_vec_reg_size(s);
+ uint32_t desc = simd_desc(svl, svl, a->sub);
+ TCGv_ptr za, zn, zm, pn, pm;
+
+ if (!sme_smza_enabled_check(s)) {
+ return true;
+ }
+
+ za = get_tile(s, esz, a->zad);
+ zn = vec_full_reg_ptr(s, a->zn);
+ zm = vec_full_reg_ptr(s, a->zm);
+ pn = pred_full_reg_ptr(s, a->pn);
+ pm = pred_full_reg_ptr(s, a->pm);
+
+ fn(za, zn, zm, pn, pm, tcg_env, tcg_constant_i32(desc));
+ return true;
+}
+
+TRANS_FEAT(FMOPA_h, aa64_sme, do_outprod_env, a,
+ MO_32, gen_helper_sme_fmopa_h)
TRANS_FEAT(FMOPA_s, aa64_sme, do_outprod_fpst, a,
MO_32, FPST_FPCR, gen_helper_sme_fmopa_s)
TRANS_FEAT(FMOPA_d, aa64_sme_f64f64, do_outprod_fpst, a,
--
2.39.2
- [Stable-8.2.7 20/53] target/arm: Don't assert for 128-bit tile accesses when SVL is 128, (continued)
- [Stable-8.2.7 20/53] target/arm: Don't assert for 128-bit tile accesses when SVL is 128, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 21/53] target/arm: Fix UMOPA/UMOPS of 16-bit values, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 22/53] target/arm: Avoid shifts by -1 in tszimm_shr() and tszimm_shl(), Michael Tokarev, 2024/09/06
- [Stable-8.2.7 23/53] target/arm: Ignore SMCR_EL2.LEN and SVCR_EL2.LEN if EL2 is not enabled, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 24/53] docs/sphinx/depfile.py: Handle env.doc2path() returning a Path not a str, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 25/53] hw/i386/amd_iommu: Don't leak memory in amdvi_update_iotlb(), Michael Tokarev, 2024/09/06
- [Stable-8.2.7 26/53] hw/arm/mps2-tz.c: fix RX/TX interrupts order, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 28/53] virtio-net: Ensure queue index fits with RSS, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 29/53] virtio-net: Fix network stall at the host side waiting for kick, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 31/53] hw/sd/sdhci: Reset @data_count index on invalid ADMA transfers, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 27/53] target/arm: Handle denormals correctly for FMOPA (widening),
Michael Tokarev <=
- [Stable-8.2.7 32/53] vvfat: Fix bug in writing to middle of file, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 34/53] vvfat: Fix wrong checks for cluster mappings invariant, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 30/53] target/i386: Fix VSIB decode, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 33/53] vvfat: Fix usage of `info.file.offset`, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 37/53] nbd/server: Plumb in new args to nbd_client_add(), Michael Tokarev, 2024/09/06
- [Stable-8.2.7 36/53] iotests: Add `vvfat` tests, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 38/53] nbd/server: CVE-2024-7409: Cap default max-connections to 100, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 35/53] vvfat: Fix reading files with non-continuous clusters, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 44/53] target/i386: Do not apply REX to MMX operands, Michael Tokarev, 2024/09/06
- [Stable-8.2.7 40/53] nbd/server: CVE-2024-7409: Close stray clients at server-stop, Michael Tokarev, 2024/09/06