[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
wc -l AVX code 10%+10% speedup
From: |
Evgeny Nizhibitsky |
Subject: |
wc -l AVX code 10%+10% speedup |
Date: |
Sat, 30 Mar 2024 14:52:38 +0000 |
Dear GNU coreutils maintainers,
It seems that I found a way to both speed-up (~10%) and simplify (13
insertions, 43 deletions) the wc -l avx code while playing with it, at
least on several million to 1 billion row files I tested with my cpu.
It mostly involves using _mm256_movemask_epi8 and __builtin_popcount
instead of the two accumulators handling that allowed me to increase the
buffer size.
I also have a further ~10% improvement in code by using 2 separate threads
instead of 1 to mitigate the usr time overhead, although it’s naturally
more complicated.
Whom should I discuss this potential contribution with?
Best wishes,
Evgeny
- wc -l AVX code 10%+10% speedup,
Evgeny Nizhibitsky <=