qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Prefetches in buffer_zero_*


From: Dr. David Alan Gilbert
Subject: Re: Prefetches in buffer_zero_*
Date: Mon, 26 Jul 2021 09:47:43 +0100
User-agent: Mutt/2.0.7 (2021-05-04)

* Joe Mario (jmario@redhat.com) wrote:
> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> wrote:
> 
> > * Richard Henderson (richard.henderson@linaro.org) wrote:
> > > On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> > > > Hi Richard,
> > > >    I think you were the last person to fiddle with the prefetching
> > > > in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> > > > prefetching still made sense on modern CPUs, and that their hardware
> > > > generally figures stuff out better on simple increments.
> > > >
> > > >    What was your thinking on this, and did you actually measure
> > > > any improvement?
> > >
> > > Ah, well, that was 5 years ago so I have no particular memory of this.
> > It
> > > wouldn't surprise me if you can't measure any improvement on modern
> > > hardware.
> > >
> > > Do you now measure an improvement with the prefetches gone?
> >
> > Not tried, it just came from Joe's suggestion that it was generally a
> > bad idea these days; I do remember that the behaviour of those functions
> > is quite tricky because there performance is VERY data dependent - many
> > VMs actually have pages that are quite dirty so you never iterate the
> > loop, but then you hit others with big zero pages and you spend your
> > entire life in the loop.
> >
> >
> Dave, Richard:
> My curiosity got the best of me.  So I created a small test program that
> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.

Thanks for testing,

> When I run it on an Intel Cascade Lake processor, the cost of calling
> "__builtin_prefetch(p)" is in the noise range .  It's always "just
> slightly" slower.  I doubt it could ever be measured in qemu.
> 
> Ironically, when I disabled the hardware prefetchers, the program slowed
> down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
> performance by over 3%.

Yeh that's a bit odd.

> My results are below, (only with the hardware prefetchers enabled).  The
> program is attached.
> Joe
> 
> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> ./a.out; done
> TSC 356144 Kcycles.
> TSC 356714 Kcycles.
> TSC 356707 Kcycles.
> TSC 356565 Kcycles.
> TSC 356853 Kcycles.
> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> TSC 355520 Kcycles.
> TSC 355961 Kcycles.
> TSC 355872 Kcycles.
> TSC 355948 Kcycles.
> TSC 355918 Kcycles.

This basically agrees with the machines I've just tried your test on -
*except* AMD EPYC 7302P's - that really like the prefetch:

[root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in 
{1..5}; do ./a.out; done
TSC 322162 Kcycles.
TSC 321861 Kcycles. 
TSC 322212 Kcycles. 
TSC 321957 Kcycles.
TSC 322085 Kcycles. 
 
[root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do 
./a.out; done
TSC 377988 Kcycles. 
TSC 380125 Kcycles. 
TSC 379440 Kcycles.
TSC 379689 Kcycles. 
TSC 379571 Kcycles. 
 
The 1st gen doesn't seem to see much difference with/without it.

Probably best to leave this code as is!

Dave


> Dave
> > >
> > > r~
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >

> /*
>  * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
>  *
>  * Compile with either:
>  *  gcc -mavx buffer_zero_avx.c -O 
>  * or
>  *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
>  */
> 
> #include <immintrin.h>
> #include <stdio.h>
> #include <stdint.h>
> #include <stddef.h>
> #include <sys/mman.h>
> #include <string.h>
> 
> #define likely(x)       __builtin_expect((x),1)
> #define unlikely(x)     __builtin_expect((x),0)
> 
> static __inline__ u_int64_t start_clock();
> static __inline__ u_int64_t stop_clock();
> static int buffer_zero_avx2(const void *buf, size_t len);
> 
> /*
>  * Allocate a large chuck of anon memory, touch/zero it, 
>  * and then time the call to buffer_zero_avx2().
>  */
> int main() 
> {
>    long i;
>    size_t mmap_len = 2UL*1024*1024*1024;
>    char *ptr = mmap(NULL, mmap_len,
>        PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0L);
> 
>    if (ptr == MAP_FAILED) {
>        perror(" mmap");
>        exit(1);
>    }
> 
>    // Touch the pages (they're already cleared)
>    memset(ptr,0x0,mmap_len);
> 
>    u_int64_t start_rdtsc = start_clock();
> 
>    buffer_zero_avx2(ptr, mmap_len);
> 
>    u_int64_t stop_rdtsc = stop_clock();
>    u_int64_t diff = stop_rdtsc - start_rdtsc;
> 
>    printf("TSC %ld Kcycles. \n", diff/1000);
> 
> }
> 
> static int 
> buffer_zero_avx2(const void *buf, size_t len)
> {
>     /* Begin with an unaligned head of 32 bytes.  */
>     __m256i t = _mm256_loadu_si256(buf);
>     __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
>     __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
> 
>     if (likely(p <= e)) {
>         /* Loop over 32-byte aligned blocks of 128.  */
>         do {
> #ifdef DO_PREFETCH
>              __builtin_prefetch(p);
> #endif
>             if (unlikely(!_mm256_testz_si256(t, t))) {
>                 printf("In unlikely buffer_zero, p:%lx \n",p);
>                 return 0;
>             }
>             t = p[-4] | p[-3] | p[-2] | p[-1];
>             p += 4;
>         } while (p <= e);
>     } else {
>         t |= _mm256_loadu_si256(buf + 32);
>         if (len <= 128) {
>             goto last2;
>         }
>     }
> 
>     /* Finish the last block of 128 unaligned.  */
>     t |= _mm256_loadu_si256(buf + len - 4 * 32);
>     t |= _mm256_loadu_si256(buf + len - 3 * 32);
> last2:
>     t |= _mm256_loadu_si256(buf + len - 2 * 32);
>     t |= _mm256_loadu_si256(buf + len - 1 * 32);
>   
>     // printf("End of buffer_zero_avx2\n");
>     return _mm256_testz_si256(t, t);
> }
> 
> static __inline__ u_int64_t 
> start_clock() {
>     // See: Intel Doc #324264, "How to Benchmark Code Execution Times on 
> Intel...",
>     u_int32_t hi, lo;
>     __asm__ __volatile__ (
>         "CPUID\n\t"
>         "RDTSC\n\t"
>         "mov %%edx, %0\n\t"
>         "mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
>         "%rax", "%rbx", "%rcx", "%rdx");
>     return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
> }
> 
> static __inline__ u_int64_t 
> stop_clock() {
>     // See: Intel Doc #324264, "How to Benchmark Code Execution Times on 
> Intel...",
>     u_int32_t hi, lo;
>     __asm__ __volatile__(
>         "RDTSCP\n\t"
>         "mov %%edx, %0\n\t"
>         "mov %%eax, %1\n\t"
>         "CPUID\n\t": "=r" (hi), "=r" (lo)::
>         "%rax", "%rbx", "%rcx", "%rdx");
>     return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
> }
> 
> 

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]