[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Vectorization, SIMD
From: |
Gerd Möllmann |
Subject: |
Re: Vectorization, SIMD |
Date: |
Sat, 26 Apr 2025 11:33:27 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) |
Gerd Möllmann <gerd.moellmann@gmail.com> writes:
> Hi Stef,
>
> I've played with this a bit and made notes, which I'd like to share.
> Please find attached.
>
> My summary so far: Can be done and it's not that difficult. There is
> even some degree of "portability" achievable when writing SIMD code by
> hand. What's will be really no fun, from my POV, is configuration stuff,
> making it work on N platforms with M compilers and so on.
>
> Don't know, maybe it's best to wait for compilers to get better at
> auto-vectorization. Or maybe GCC is better than LLVM in this regard. I
> haven't checked that because I can't really use GCC here on macOS.
FWIW, this is what it could look like in C using ARM Neon.
// clang -S neon.c
#include <arm_neon.h>
int count_char_heads(const char *s, int len) {
// Create vectors of constants 16x 0xC0 and 0x80.
uint8x16_t v0xC0 = vmovq_n_u8(0xC0);
uint8x16_t v0x80 = vmovq_n_u8(0x80);
int count = 0;
const uint8x16_t *p = (const uint8x16_t *) s;
const uint8x16_t *end = (const uint8x16_t *) (s + len);
for (; p < end; ++p) {
// And each byte with 0xC0
uint8x16_t a_and_c0 = vandq_u8(*p, v0xC0);
// Compare each byte with 0x80
uint8x16_t eq = vceqq_u8(a_and_c0, v0x80);
// Reduce vector. Add up all non-zero matches to obtain the
// number of CHAR_HEAD_P and use that. This could be further
// improved by using an accumulator vector and reduce when that
// would overflow (i < 127).
count += vaddlvq_s8(eq);
}
return count;
}
In Emacs, one would of course functions around such function, that take
the gap into account and so on.
Not difficult, but I don't think I'll do that, since we don't have a
problem :-).
Re: Vectorization, SIMD,
Gerd Möllmann <=