qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH v2 1/3] target-arm: Use Neon for zero checki


From: Peter Maydell
Subject: Re: [Qemu-devel] [RFC PATCH v2 1/3] target-arm: Use Neon for zero checking
Date: Thu, 7 Apr 2016 11:44:05 +0100

On 7 April 2016 at 11:30, Paolo Bonzini <address@hidden> wrote:
>
>> +#elif defined __aarch64__
>> +#include "arm_neon.h"
>> +
>> +#define NEON_VECTYPE               uint64x2_t
>> +#define NEON_LOAD_N_ORR(v1, v2)    (vld1q_u64(&v1) | vld1q_u64(&v2))
>
> Why is the load and orr necessary?  Is ((v1) | (v2)) enough?
>
>> +#define NEON_ORR(v1, v2)           ((v1) | (v2))
>> +#define NEON_NOT_EQ_ZERO(v1) \
>> +        ((vgetq_lane_u64(v1, 0) != 0) || (vgetq_lane_u64(v1, 1) != 0))
>> +
>> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR_NEON 16
>
> Unless you have numbers saying that a 16-unroll is better than an 8-unroll
> (and then you should put those in the commit message), you do not need to
> duplicate code, just add aarch64 definitions for the existing code.

This pure-neon code is also not doing the initial short-loop to
test for non-zero buffers, which means it's not an apples-to-apples
comparison. It seems unlikely that workload balances are going
to be different on ARM vs x86 such that it's worth doing the
small loop on one but not the other. (This is also why it's helpful
to explain your benchmarking method -- the short loop will slow
things down for some cases like "large and untouched RAM", but be
faster again for cases like "large RAM of which most pages have
been dirtied".)

thanks
-- PMM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]