I thought I see if I can speed up PNG loading by vectorizing alpha premultiplication, and it actually does give a nice speedup:
commit d7d592b0acb25ad8084b1d60459dd40bfd9c3356 (HEAD -> png-faster, github/png-faster)
Author: Behdad Esfahbod <address@hidden
Date: Tue Aug 8 21:29:25 2017 -0700
Process four pixels at a time in premultiply_data() PNG function
Load/store using memcpy(). Now this is finally faster than the non-vectorized
code. The premultiply_data() overhead is reduced by 60%.
$ ftbench -b a ~/.fonts/NotoColorEmoji.ttf
Without premultiply_data: 155 us/op
With 4-pixel vectorization: 167 us/op <---------
Without vectorization: 182 us/op
The code is rather terse but readable. I can add comments. Needs some GCC/clang checks, as well as implementing the big-endian case (or disable it for big-endian). I couldn't find any endianness macros in FreeType.