qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Prefetches in buffer_zero_*


From: Dr. David Alan Gilbert
Subject: Re: Prefetches in buffer_zero_*
Date: Mon, 26 Jul 2021 13:07:50 +0100
User-agent: Mutt/2.0.7 (2021-05-04)

* Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
> +Lukáš
> 
> On 7/26/21 10:47 AM, Dr. David Alan Gilbert wrote:
> > * Joe Mario (jmario@redhat.com) wrote:
> >> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert 
> >> <dgilbert@redhat.com>
> >> wrote:
> >>
> >>> * Richard Henderson (richard.henderson@linaro.org) wrote:
> >>>> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> >>>>> Hi Richard,
> >>>>>    I think you were the last person to fiddle with the prefetching
> >>>>> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> >>>>> prefetching still made sense on modern CPUs, and that their hardware
> >>>>> generally figures stuff out better on simple increments.
> >>>>>
> >>>>>    What was your thinking on this, and did you actually measure
> >>>>> any improvement?
> >>>>
> >>>> Ah, well, that was 5 years ago so I have no particular memory of this.
> >>> It
> >>>> wouldn't surprise me if you can't measure any improvement on modern
> >>>> hardware.
> >>>>
> >>>> Do you now measure an improvement with the prefetches gone?
> >>>
> >>> Not tried, it just came from Joe's suggestion that it was generally a
> >>> bad idea these days; I do remember that the behaviour of those functions
> >>> is quite tricky because there performance is VERY data dependent - many
> >>> VMs actually have pages that are quite dirty so you never iterate the
> >>> loop, but then you hit others with big zero pages and you spend your
> >>> entire life in the loop.
> >>>
> >>>
> >> Dave, Richard:
> >> My curiosity got the best of me.  So I created a small test program that
> >> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
> > 
> > Thanks for testing,
> > 
> >> When I run it on an Intel Cascade Lake processor, the cost of calling
> >> "__builtin_prefetch(p)" is in the noise range .  It's always "just
> >> slightly" slower.  I doubt it could ever be measured in qemu.
> >>
> >> Ironically, when I disabled the hardware prefetchers, the program slowed
> >> down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
> >> performance by over 3%.
> > 
> > Yeh that's a bit odd.
> > 
> >> My results are below, (only with the hardware prefetchers enabled).  The
> >> program is attached.
> >> Joe
> >>
> >> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> >> ./a.out; done
> >> TSC 356144 Kcycles.
> >> TSC 356714 Kcycles.
> >> TSC 356707 Kcycles.
> >> TSC 356565 Kcycles.
> >> TSC 356853 Kcycles.
> >> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> >> TSC 355520 Kcycles.
> >> TSC 355961 Kcycles.
> >> TSC 355872 Kcycles.
> >> TSC 355948 Kcycles.
> >> TSC 355918 Kcycles.
> > 
> > This basically agrees with the machines I've just tried your test on -
> > *except* AMD EPYC 7302P's - that really like the prefetch:
> > 
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i 
> > in {1..5}; do ./a.out; done
> > TSC 322162 Kcycles.
> > TSC 321861 Kcycles. 
> > TSC 322212 Kcycles. 
> > TSC 321957 Kcycles.
> > TSC 322085 Kcycles. 
> >  
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do 
> > ./a.out; done
> > TSC 377988 Kcycles. 
> > TSC 380125 Kcycles. 
> > TSC 379440 Kcycles.
> > TSC 379689 Kcycles. 
> > TSC 379571 Kcycles. 
> >  
> > The 1st gen doesn't seem to see much difference with/without it.
> > 
> > Probably best to leave this code as is!
> 
> Regardless the decision of changing the code or not, it would be
> nice to have this test committed in the repository to run
> performance regression testing from time to time.

It could be, although this is a slightly odd microtest for that; it's a bit
specific (the avx2 variant, and only really testing the all zero case).


Dave

> >> /*
> >>  * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
> >>  *
> >>  * Compile with either:
> >>  *  gcc -mavx buffer_zero_avx.c -O 
> >>  * or
> >>  *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
> >>  */
> >>
> [...]
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]