[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Prefetches in buffer_zero_*
From: |
Dr. David Alan Gilbert |
Subject: |
Re: Prefetches in buffer_zero_* |
Date: |
Mon, 26 Jul 2021 09:47:43 +0100 |
User-agent: |
Mutt/2.0.7 (2021-05-04) |
* Joe Mario (jmario@redhat.com) wrote:
> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> wrote:
>
> > * Richard Henderson (richard.henderson@linaro.org) wrote:
> > > On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> > > > Hi Richard,
> > > > I think you were the last person to fiddle with the prefetching
> > > > in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> > > > prefetching still made sense on modern CPUs, and that their hardware
> > > > generally figures stuff out better on simple increments.
> > > >
> > > > What was your thinking on this, and did you actually measure
> > > > any improvement?
> > >
> > > Ah, well, that was 5 years ago so I have no particular memory of this.
> > It
> > > wouldn't surprise me if you can't measure any improvement on modern
> > > hardware.
> > >
> > > Do you now measure an improvement with the prefetches gone?
> >
> > Not tried, it just came from Joe's suggestion that it was generally a
> > bad idea these days; I do remember that the behaviour of those functions
> > is quite tricky because there performance is VERY data dependent - many
> > VMs actually have pages that are quite dirty so you never iterate the
> > loop, but then you hit others with big zero pages and you spend your
> > entire life in the loop.
> >
> >
> Dave, Richard:
> My curiosity got the best of me. So I created a small test program that
> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
Thanks for testing,
> When I run it on an Intel Cascade Lake processor, the cost of calling
> "__builtin_prefetch(p)" is in the noise range . It's always "just
> slightly" slower. I doubt it could ever be measured in qemu.
>
> Ironically, when I disabled the hardware prefetchers, the program slowed
> down over 33%. And the call to "__builtin_prefetch(p)" actually hurt
> performance by over 3%.
Yeh that's a bit odd.
> My results are below, (only with the hardware prefetchers enabled). The
> program is attached.
> Joe
>
> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> ./a.out; done
> TSC 356144 Kcycles.
> TSC 356714 Kcycles.
> TSC 356707 Kcycles.
> TSC 356565 Kcycles.
> TSC 356853 Kcycles.
> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> TSC 355520 Kcycles.
> TSC 355961 Kcycles.
> TSC 355872 Kcycles.
> TSC 355948 Kcycles.
> TSC 355918 Kcycles.
This basically agrees with the machines I've just tried your test on -
*except* AMD EPYC 7302P's - that really like the prefetch:
[root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in
{1..5}; do ./a.out; done
TSC 322162 Kcycles.
TSC 321861 Kcycles.
TSC 322212 Kcycles.
TSC 321957 Kcycles.
TSC 322085 Kcycles.
[root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do
./a.out; done
TSC 377988 Kcycles.
TSC 380125 Kcycles.
TSC 379440 Kcycles.
TSC 379689 Kcycles.
TSC 379571 Kcycles.
The 1st gen doesn't seem to see much difference with/without it.
Probably best to leave this code as is!
Dave
> Dave
> > >
> > > r~
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >
> /*
> * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
> *
> * Compile with either:
> * gcc -mavx buffer_zero_avx.c -O
> * or
> * gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH
> */
>
> #include <immintrin.h>
> #include <stdio.h>
> #include <stdint.h>
> #include <stddef.h>
> #include <sys/mman.h>
> #include <string.h>
>
> #define likely(x) __builtin_expect((x),1)
> #define unlikely(x) __builtin_expect((x),0)
>
> static __inline__ u_int64_t start_clock();
> static __inline__ u_int64_t stop_clock();
> static int buffer_zero_avx2(const void *buf, size_t len);
>
> /*
> * Allocate a large chuck of anon memory, touch/zero it,
> * and then time the call to buffer_zero_avx2().
> */
> int main()
> {
> long i;
> size_t mmap_len = 2UL*1024*1024*1024;
> char *ptr = mmap(NULL, mmap_len,
> PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0L);
>
> if (ptr == MAP_FAILED) {
> perror(" mmap");
> exit(1);
> }
>
> // Touch the pages (they're already cleared)
> memset(ptr,0x0,mmap_len);
>
> u_int64_t start_rdtsc = start_clock();
>
> buffer_zero_avx2(ptr, mmap_len);
>
> u_int64_t stop_rdtsc = stop_clock();
> u_int64_t diff = stop_rdtsc - start_rdtsc;
>
> printf("TSC %ld Kcycles. \n", diff/1000);
>
> }
>
> static int
> buffer_zero_avx2(const void *buf, size_t len)
> {
> /* Begin with an unaligned head of 32 bytes. */
> __m256i t = _mm256_loadu_si256(buf);
> __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
> __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
>
> if (likely(p <= e)) {
> /* Loop over 32-byte aligned blocks of 128. */
> do {
> #ifdef DO_PREFETCH
> __builtin_prefetch(p);
> #endif
> if (unlikely(!_mm256_testz_si256(t, t))) {
> printf("In unlikely buffer_zero, p:%lx \n",p);
> return 0;
> }
> t = p[-4] | p[-3] | p[-2] | p[-1];
> p += 4;
> } while (p <= e);
> } else {
> t |= _mm256_loadu_si256(buf + 32);
> if (len <= 128) {
> goto last2;
> }
> }
>
> /* Finish the last block of 128 unaligned. */
> t |= _mm256_loadu_si256(buf + len - 4 * 32);
> t |= _mm256_loadu_si256(buf + len - 3 * 32);
> last2:
> t |= _mm256_loadu_si256(buf + len - 2 * 32);
> t |= _mm256_loadu_si256(buf + len - 1 * 32);
>
> // printf("End of buffer_zero_avx2\n");
> return _mm256_testz_si256(t, t);
> }
>
> static __inline__ u_int64_t
> start_clock() {
> // See: Intel Doc #324264, "How to Benchmark Code Execution Times on
> Intel...",
> u_int32_t hi, lo;
> __asm__ __volatile__ (
> "CPUID\n\t"
> "RDTSC\n\t"
> "mov %%edx, %0\n\t"
> "mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
> "%rax", "%rbx", "%rcx", "%rdx");
> return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
> }
>
> static __inline__ u_int64_t
> stop_clock() {
> // See: Intel Doc #324264, "How to Benchmark Code Execution Times on
> Intel...",
> u_int32_t hi, lo;
> __asm__ __volatile__(
> "RDTSCP\n\t"
> "mov %%edx, %0\n\t"
> "mov %%eax, %1\n\t"
> "CPUID\n\t": "=r" (hi), "=r" (lo)::
> "%rax", "%rbx", "%rcx", "%rdx");
> return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
> }
>
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK