:PROPERTIES: :ID: 2E64F482-C786-4BF9-9B01-753298228CDF :END: #+title: SIMD * Hand-coding in C SIMD instructions can be used in C via C intrinsics, i.e. types and functions operating on these types. ** ARM NEON #+begin_quote ARM NEON is a Single Instruction Multiple Data (SIMD) extension in the ARM architecture, designed to accelerate multimedia and signal processing applications. It's a core part of the ARMv7 and later architectures, providing parallel data manipulation capabilities on a single processing core. #+end_quote See also [[https://developer.arm.com/Architectures/Neon][ARM NEON]]. 32 registers of 128 bits. The NEON intrinsics are described in [[https://arm-software.github.io/acle/neon_intrinsics/advsimd.html][NEON Intrinsics]]. For LLVM, see also this [[https://blog.llvm.org/2010/04/arm-advanced-simd-neon-intrinsics-and.html][LLVM Blog]]. Doesn't look like rocket science. There are certain types defined like =int32x4_t=, and large number of intrinsics for loading, adding and whatever else using these types. See also =arm_neon.h=. AFAIU, =__ARM_NEON__= is defined at compile time, not sure. Finding UTF-8 character starts would work like: - Load 16 bytes - And 16 bytes with a mask of 16 0xC0 bytes in parallel - Compare 16 bytes with 0x80 in parallel in the same way - Find the bytes which compare true - Next 16 bytes ** Portability An implementation in one set of intrinsics can be translated to another set of intrinsics using [[https://github.com/simd-everywhere/simde][SIMDe]]. For example, if one has an implementation using NEON intrinsics, one can use SIMDE to get an AVX implementation. It probably won't be as efficient as a native AVX implementation, but it's a start and maybe good enough. #+begin_quote The SIMDe header-only library provides fast, portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM. There is no performance penalty if the hardware supports the native implementation (e.g., SSE/AVX runs at full speed on x86, NEON on ARM, etc.). #+end_quote MIT license, C. * Compiler Auto-Vectorization Probably most promising route for the future, but currently not easy to get working, and probably not easy to ensure that it works on different compiler versions. ** Clang 20 Documentation [[https://llvm.org/docs/Vectorizers.html][Auto-Vectorization]]. Various places say it's not as good as writing code by hand (unsurprising), and sometimes doesn't work well. Others say with a little help like the use of =restrict= in C, and =#pragma clang loop...= it works well enough. Is on by default with =-O2= or higher. Requires something like =-march=native=. *** Example, not vectorized #+begin_src c size_t scan (const char *s, size_t n) { size_t i; for (i = 0; i < n; ++i) { char c = s[i]; if ((c & 0xc0) == 0x80) break; } return i; } #+end_src Compiled with =clang -S -O2 neon.c= gives the not vectorized assembler code #+begin_src asm .build_version macos, 15, 0 .section __TEXT,__text,regular,pure_instructions .globl _scan ; -- Begin function scan .p2align 2 _scan: ; @scan .cfi_startproc ; %bb.0: cbz x1, LBB0_6 ; %bb.1: mov x8, x0 mov x0, #0 ; =0x0 LBB0_2: ; =>This Inner Loop Header: Depth=1 ldrsb w9, [x8, x0] cmn w9, #64 b.lt LBB0_5 ; %bb.3: ; in Loop: Header=BB0_2 Depth=1 add x0, x0, #1 cmp x1, x0 b.ne LBB0_2 ; %bb.4: mov x0, x1 LBB0_5: ret LBB0_6: mov x0, #0 ; =0x0 ret .cfi_endproc ; -- End function .subsections_via_symbols #+end_src =clang -S -O2 -march=native neon.c= tries to vectorize but fails which can be seen with =-Rpass-analysis=loop-vectorize=. #+begin_src sh clang -O2 -march=native -Rpass-analysis=loop-vectorize -S neon.c neon.c:23:5: remark: loop not vectorized: Cannot vectorize potentially faulting early exit loop [-Rpass-analysis=loop-vectorize] 23 | for (i = 0; i < n; ++i) | ^ #+end_src Other cases fail to vectorize as well. *** Example, vectorized Here is an example that actually is vectorized. #+begin_src C size_t scan (const char *s, size_t n) { size_t i = 0, j = 0; while (i < n) { char c = s[i]; if ((c & 0xc0) == 0x80) ++j; ++i; } return j; } #+end_src COmpiled with =clang -O2 -march=native -Rpass-analysis=loop-vectorize -S neon.c= gices #+begin_src asm .build_version macos, 15, 0 .section __TEXT,__text,regular,pure_instructions .globl _scan ; -- Begin function scan .p2align 2 _scan: ; @scan .cfi_startproc ; %bb.0: cbz x1, LBB0_3 ; %bb.1: cmp x1, #8 b.hs LBB0_4 ; %bb.2: mov x9, #0 ; =0x0 mov x8, #0 ; =0x0 b LBB0_13 LBB0_3: mov x8, #0 ; =0x0 mov x0, x8 ret LBB0_4: cmp x1, #32 b.hs LBB0_6 ; %bb.5: mov x9, #0 ; =0x0 mov x8, #0 ; =0x0 b LBB0_10 LBB0_6: movi.2d v0, #0000000000000000 movi.16b v1, #192 mov w8, #1 ; =0x1 dup.2d v2, x8 and x9, x1, #0xffffffffffffffe0 movi.2d v3, #0000000000000000 add x8, x0, #16 movi.2d v4, #0000000000000000 mov x10, x9 movi.2d v17, #0000000000000000 movi.2d v5, #0000000000000000 movi.2d v7, #0000000000000000 movi.2d v6, #0000000000000000 movi.2d v19, #0000000000000000 movi.2d v16, #0000000000000000 movi.2d v21, #0000000000000000 movi.2d v20, #0000000000000000 movi.2d v24, #0000000000000000 movi.2d v18, #0000000000000000 movi.2d v23, #0000000000000000 movi.2d v22, #0000000000000000 movi.2d v25, #0000000000000000 LBB0_7: ; =>This Inner Loop Header: Depth=1 ldp q27, q26, [x8, #-16] cmgt.16b v28, v1, v27 ushll.8h v27, v28, #0 ushll2.8h v28, v28, #0 ushll2.4s v29, v28, #0 ushll2.2d v30, v29, #0 and.16b v30, v30, v2 add.2d v19, v19, v30 ushll2.4s v30, v27, #0 ushll.4s v28, v28, #0 ushll.2d v29, v29, #0 and.16b v29, v29, v2 add.2d v6, v6, v29 ushll2.2d v29, v28, #0 and.16b v29, v29, v2 add.2d v7, v7, v29 ushll2.2d v29, v30, #0 and.16b v29, v29, v2 add.2d v17, v17, v29 ushll.4s v27, v27, #0 ushll.2d v28, v28, #0 and.16b v28, v28, v2 add.2d v5, v5, v28 ushll.2d v28, v27, #0 and.16b v28, v28, v2 ushll2.2d v27, v27, #0 and.16b v27, v27, v2 ushll.2d v29, v30, #0 and.16b v29, v29, v2 cmgt.16b v26, v1, v26 add.2d v4, v4, v29 ushll.8h v29, v26, #0 ushll2.8h v26, v26, #0 add.2d v3, v3, v27 ushll2.4s v27, v26, #0 add.2d v0, v0, v28 ushll2.2d v28, v27, #0 and.16b v28, v28, v2 add.2d v25, v25, v28 ushll2.4s v28, v29, #0 ushll.4s v26, v26, #0 ushll.2d v27, v27, #0 and.16b v27, v27, v2 add.2d v22, v22, v27 ushll2.2d v27, v26, #0 and.16b v27, v27, v2 add.2d v23, v23, v27 ushll2.2d v27, v28, #0 and.16b v27, v27, v2 add.2d v24, v24, v27 ushll.2d v26, v26, #0 and.16b v26, v26, v2 add.2d v18, v18, v26 ushll.4s v26, v29, #0 ushll.2d v27, v28, #0 and.16b v27, v27, v2 add.2d v20, v20, v27 ushll2.2d v27, v26, #0 and.16b v27, v27, v2 add.2d v21, v21, v27 ushll.2d v26, v26, #0 and.16b v26, v26, v2 add.2d v16, v16, v26 add x8, x8, #32 subs x10, x10, #32 b.ne LBB0_7 ; %bb.8: add.2d v1, v24, v17 add.2d v2, v25, v19 add.2d v3, v21, v3 add.2d v7, v23, v7 add.2d v4, v20, v4 add.2d v6, v22, v6 add.2d v0, v16, v0 add.2d v5, v18, v5 add.2d v0, v0, v5 add.2d v4, v4, v6 add.2d v0, v0, v4 add.2d v3, v3, v7 add.2d v1, v1, v2 add.2d v1, v3, v1 add.2d v0, v0, v1 addp.2d d0, v0 fmov x8, d0 cmp x1, x9 b.eq LBB0_15 ; %bb.9: tst x1, #0x18 b.eq LBB0_13 LBB0_10: mov x10, x9 and x9, x1, #0xfffffffffffffff8 movi.2d v0, #0000000000000000 movi.2d v1, #0000000000000000 mov.d v1[0], x8 add x8, x0, x10 sub x10, x10, x9 movi.8b v2, #192 mov w11, #1 ; =0x1 dup.2d v3, x11 movi.2d v4, #0000000000000000 movi.2d v5, #0000000000000000 LBB0_11: ; =>This Inner Loop Header: Depth=1 ldr d6, [x8], #8 cmgt.8b v6, v2, v6 ushll.8h v6, v6, #0 ushll.4s v7, v6, #0 ushll.2d v16, v7, #0 and.16b v16, v16, v3 ushll2.2d v7, v7, #0 and.16b v7, v7, v3 ushll2.4s v6, v6, #0 ushll.2d v17, v6, #0 and.16b v17, v17, v3 ushll2.2d v6, v6, #0 and.16b v6, v6, v3 add.2d v5, v5, v6 add.2d v4, v4, v17 add.2d v0, v0, v7 add.2d v1, v1, v16 adds x10, x10, #8 b.ne LBB0_11 ; %bb.12: add.2d v1, v1, v4 add.2d v0, v0, v5 add.2d v0, v1, v0 addp.2d d0, v0 fmov x8, d0 cmp x1, x9 b.eq LBB0_15 LBB0_13: sub x10, x1, x9 add x9, x0, x9 LBB0_14: ; =>This Inner Loop Header: Depth=1 ldrsb w11, [x9], #1 cmn w11, #64 cinc x8, x8, lt subs x10, x10, #1 b.ne LBB0_14 LBB0_15: mov x0, x8 ret .cfi_endproc ; -- End function .subsections_via_symbols #+end_src ** GCC Not looked at because Clang is the only compiler I can use on macOS.