Hi,
I implemented another improvement for cksum to increase the speed of it some
more. It is possible to use x86 pclmul hardware instruction for CRC32
calculation. The patch detects support for this by using CPUID, and falls back
to the slice by 8 algorithm if no support. Also added detection in autoconf, so
it only will be compiled on supported targets.
By my testing it seem the checksum calculation is sped up about 6x compared to
slice by 8 algorithm (looking at user time). However! Since the time the
process spends waiting on syscalls (fread) is still the same, actual real time
speedup is only 3x. It would be an interesting exercise to try to use async IO,
so you could checksum one block while reading the next. Maybe I will try that
one day.
As a sidenote, x86 also has a crc32 hardware instruction but it uses a
different polynominal than cksum does, so not possible to use here.
Some benchmarking with a file already in file cache.
Oldest version: (byte by byte)
ztion@rita:~/coreutils/coreutils-8.32/src$ time ./cksum
/disk2/download/bigfile2G
real 0m7,311s
user 0m7,039s
sys 0m0,262s
Slice by 8 version:
ztion@rita:~/coreutils/coreutils-8.32/src$ time ./cksum.slice
/disk2/download/bigfile2G
real 0m1,546s
user 0m1,267s
sys 0m0,247s
ztion@rita:~/coreutils/coreutils_fork/src$ time ./cksum
/disk2/download/bigfile2G
real 0m0,462s
user 0m0,191s
sys 0m0,271s
The patch is at:
https://github.com/coreutils/coreutils/pull/48