[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Prefetches in buffer_zero_*
From: |
Dr. David Alan Gilbert |
Subject: |
Re: Prefetches in buffer_zero_* |
Date: |
Mon, 26 Jul 2021 13:07:50 +0100 |
User-agent: |
Mutt/2.0.7 (2021-05-04) |
* Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
> +Lukáš
>
> On 7/26/21 10:47 AM, Dr. David Alan Gilbert wrote:
> > * Joe Mario (jmario@redhat.com) wrote:
> >> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert
> >> <dgilbert@redhat.com>
> >> wrote:
> >>
> >>> * Richard Henderson (richard.henderson@linaro.org) wrote:
> >>>> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> >>>>> Hi Richard,
> >>>>> I think you were the last person to fiddle with the prefetching
> >>>>> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> >>>>> prefetching still made sense on modern CPUs, and that their hardware
> >>>>> generally figures stuff out better on simple increments.
> >>>>>
> >>>>> What was your thinking on this, and did you actually measure
> >>>>> any improvement?
> >>>>
> >>>> Ah, well, that was 5 years ago so I have no particular memory of this.
> >>> It
> >>>> wouldn't surprise me if you can't measure any improvement on modern
> >>>> hardware.
> >>>>
> >>>> Do you now measure an improvement with the prefetches gone?
> >>>
> >>> Not tried, it just came from Joe's suggestion that it was generally a
> >>> bad idea these days; I do remember that the behaviour of those functions
> >>> is quite tricky because there performance is VERY data dependent - many
> >>> VMs actually have pages that are quite dirty so you never iterate the
> >>> loop, but then you hit others with big zero pages and you spend your
> >>> entire life in the loop.
> >>>
> >>>
> >> Dave, Richard:
> >> My curiosity got the best of me. So I created a small test program that
> >> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
> >
> > Thanks for testing,
> >
> >> When I run it on an Intel Cascade Lake processor, the cost of calling
> >> "__builtin_prefetch(p)" is in the noise range . It's always "just
> >> slightly" slower. I doubt it could ever be measured in qemu.
> >>
> >> Ironically, when I disabled the hardware prefetchers, the program slowed
> >> down over 33%. And the call to "__builtin_prefetch(p)" actually hurt
> >> performance by over 3%.
> >
> > Yeh that's a bit odd.
> >
> >> My results are below, (only with the hardware prefetchers enabled). The
> >> program is attached.
> >> Joe
> >>
> >> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> >> ./a.out; done
> >> TSC 356144 Kcycles.
> >> TSC 356714 Kcycles.
> >> TSC 356707 Kcycles.
> >> TSC 356565 Kcycles.
> >> TSC 356853 Kcycles.
> >> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> >> TSC 355520 Kcycles.
> >> TSC 355961 Kcycles.
> >> TSC 355872 Kcycles.
> >> TSC 355948 Kcycles.
> >> TSC 355918 Kcycles.
> >
> > This basically agrees with the machines I've just tried your test on -
> > *except* AMD EPYC 7302P's - that really like the prefetch:
> >
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i
> > in {1..5}; do ./a.out; done
> > TSC 322162 Kcycles.
> > TSC 321861 Kcycles.
> > TSC 322212 Kcycles.
> > TSC 321957 Kcycles.
> > TSC 322085 Kcycles.
> >
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do
> > ./a.out; done
> > TSC 377988 Kcycles.
> > TSC 380125 Kcycles.
> > TSC 379440 Kcycles.
> > TSC 379689 Kcycles.
> > TSC 379571 Kcycles.
> >
> > The 1st gen doesn't seem to see much difference with/without it.
> >
> > Probably best to leave this code as is!
>
> Regardless the decision of changing the code or not, it would be
> nice to have this test committed in the repository to run
> performance regression testing from time to time.
It could be, although this is a slightly odd microtest for that; it's a bit
specific (the avx2 variant, and only really testing the all zero case).
Dave
> >> /*
> >> * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
> >> *
> >> * Compile with either:
> >> * gcc -mavx buffer_zero_avx.c -O
> >> * or
> >> * gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH
> >> */
> >>
> [...]
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK