qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions


From: Richard Henderson
Subject: Re: [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions
Date: Wed, 27 Apr 2022 15:28:07 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0

On 4/27/22 13:24, Lucas Mateus Martins Araujo e Castro wrote:

On 26/04/2022 20:40, Richard Henderson wrote:

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
+%xx_at          23:3 !function=times_4
+@XX3_at         ...... ... .. ..... ..... ........ ... &XX3 xt=%xx_at xb=%xx_xb

Hmm.  Depends, I suppose on whether you want acc[0-7] or vsr[0-28]
I mostly used VSR function here, but since I'll change the patch 1 to your suggestion (which will require creating acc_full_offset) I'll make a few changes to create some functions for the accumulator

+/*
+ * Packed VSX Integer GER Flags
+ * 00 - no accumulation no saturation
+ * 01 - accumulate but no saturation
+ * 10 - no accumulation but with saturation
+ * 11 - accumulate with saturation
+ */
+static inline bool get_sat(uint32_t flags)
+{
+    return flags & 0x2;
+}
+
+static inline bool get_acc(uint32_t flags)
+{
+    return flags & 0x1;
+}

Better to have separate helpers for these?  They'd be immediate operands to the 
function
replacing XVIGER (see below) and thus optimize well.
Do you mean different functions or a function that receives packed_flags along with the callback functions?

I mean separate helper entry points, which use a common function that receives these as separate boolean arguments, along with the callbacks. Use QEMU_FLATTEN on the helper entry points to ensure that everything is inlined and the constant args are optimized.

In this case it'd be necessary to receive 2 xviger_extract functions since XVI8GER4* multiply one value as signed and the other as unsigned (and other integer GER treat both as signed).

Certainly.


An alternative would be to isolate the innermost loop into a different 
function, like:

     typedef int64_t do_ger(int32_t a, int32_t b, int32_t at, int32_t pmsk);

     static int64_t ger_rank4(int32_t a, int32_t b, int32_t at, int32_t mask)
     {
         int64_t psum = 0, i;
         for (i = 0; i < 4; i++, mask >>= 1) {
             if (mask & 1) {
                 psum += (sextract32(a, i * 8, 8)) * (extract32(b, i * 8, 8));
            }
         }
         return psum;
     }

That way we could avoid having 'rank' as a parameter, what do you think?

Reasonable. I certainly like extracting uint32_t from the vector generically and not having to pass that on further.

Why are you passing register numbers instead of pointers, like everywhere else?
Because here we are not working only with 1 register per register number, the ACC uses 4 and the XVF64GER* needs to use XA and XA+1, and while VSR is an array so I could do ppc_vsr_ptr+1 I thought it was better not to access memory I was not given a pointer to, so I passed XA so I can request cpu_vsr_ptr(env, xa) and cpu_vsr_ptr(env, xa + 1)

I think using cpu_vsr_ptr is the mistake.

It might be clarifying to define a ppc_acc_t, if only as a typedef of ppc_vsr_t. The acc_full_offset function will compute the offset for this pointer and, importantly, will be the place to modify if and when the architecture changes to allow or require separate storage for the ACC registers.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]