qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 33/35] target/arm: Implement SVE dot product


From: Peter Maydell
Subject: Re: [Qemu-devel] [PATCH v5 33/35] target/arm: Implement SVE dot product (indexed)
Date: Tue, 26 Jun 2018 17:30:02 +0100

On 26 June 2018 at 17:17, Richard Henderson
<address@hidden> wrote:
> On 06/26/2018 08:30 AM, Peter Maydell wrote:
>> On 21 June 2018 at 02:53, Richard Henderson
>> <address@hidden> wrote:
>>> Signed-off-by: Richard Henderson <address@hidden>
>>> ---
>>>  target/arm/helper.h        |  5 ++
>>>  target/arm/translate-sve.c | 18 +++++++
>>>  target/arm/vec_helper.c    | 96 ++++++++++++++++++++++++++++++++++++++
>>>  target/arm/sve.decode      |  8 +++-
>>>  4 files changed, 126 insertions(+), 1 deletion(-)
>>>
>>
>>> +void HELPER(gvec_sdot_idx_b)(void *vd, void *vn, void *vm, uint32_t desc)
>>> +{
>>> +    intptr_t i, j, opr_sz = simd_oprsz(desc), opr_sz_4 = opr_sz / 4;
>>> +    intptr_t index = simd_data(desc);
>>> +    uint32_t *d = vd;
>>> +    int8_t *n = vn, *m = vm;
>>> +
>>> +    for (i = 0; i < opr_sz_4; i = j) {
>>> +        int8_t m0 = m[(i + index) * 4 + 0];
>>> +        int8_t m1 = m[(i + index) * 4 + 1];
>>> +        int8_t m2 = m[(i + index) * 4 + 2];
>>> +        int8_t m3 = m[(i + index) * 4 + 3];
>>> +
>>> +        j = i;
>>> +        do {
>>> +            d[j] += n[j * 4 + 0] * m0
>>> +                  + n[j * 4 + 1] * m1
>>> +                  + n[j * 4 + 2] * m2
>>> +                  + n[j * 4 + 3] * m3;
>>> +        } while (++j < MIN(i + 4, opr_sz_4));
>>> +    }
>>> +    clear_tail(d, opr_sz, simd_maxsz(desc));
>>> +}
>>
>> Maybe I'm just half asleep this afternoon, but this is pretty
>> confusing -- nested loops where the outer loop's increment
>> uses the inner loop's index, and the inner loop's conditions
>> depend on the outer loop index...
>
> Yeah, well.
>
> There is an edge case of aa64 advsimd, reusing this same helper,
>
>         sdot    v0.2s, v1.8b, v0.4b[0]
>
> where m values must be read (and held) before writing d results,
> and there are not 16/4=4 elements to process but only 2.
>
> I suppose I could special-case oprsz == 8 in order to simplify
> iteration of what is otherwise a multiple of 16.
>
> I thought iterating J from I to I+4 was easier to read than
> writing out I+J everywhere.  Perhaps not.

Mmm. I did indeed fail to notice the symmetry between the
indexes into m[] and those into n[].
The other bit that threw me is where the outer loop on i
updates using j.

A comment describing the intent might assist ?

thanks
-- PMM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]