qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Prefetches in buffer_zero_*


From: Philippe Mathieu-Daudé
Subject: Re: Prefetches in buffer_zero_*
Date: Mon, 26 Jul 2021 13:31:39 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

+Lukáš

On 7/26/21 10:47 AM, Dr. David Alan Gilbert wrote:
> * Joe Mario (jmario@redhat.com) wrote:
>> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
>> wrote:
>>
>>> * Richard Henderson (richard.henderson@linaro.org) wrote:
>>>> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
>>>>> Hi Richard,
>>>>>    I think you were the last person to fiddle with the prefetching
>>>>> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
>>>>> prefetching still made sense on modern CPUs, and that their hardware
>>>>> generally figures stuff out better on simple increments.
>>>>>
>>>>>    What was your thinking on this, and did you actually measure
>>>>> any improvement?
>>>>
>>>> Ah, well, that was 5 years ago so I have no particular memory of this.
>>> It
>>>> wouldn't surprise me if you can't measure any improvement on modern
>>>> hardware.
>>>>
>>>> Do you now measure an improvement with the prefetches gone?
>>>
>>> Not tried, it just came from Joe's suggestion that it was generally a
>>> bad idea these days; I do remember that the behaviour of those functions
>>> is quite tricky because there performance is VERY data dependent - many
>>> VMs actually have pages that are quite dirty so you never iterate the
>>> loop, but then you hit others with big zero pages and you spend your
>>> entire life in the loop.
>>>
>>>
>> Dave, Richard:
>> My curiosity got the best of me.  So I created a small test program that
>> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
> 
> Thanks for testing,
> 
>> When I run it on an Intel Cascade Lake processor, the cost of calling
>> "__builtin_prefetch(p)" is in the noise range .  It's always "just
>> slightly" slower.  I doubt it could ever be measured in qemu.
>>
>> Ironically, when I disabled the hardware prefetchers, the program slowed
>> down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
>> performance by over 3%.
> 
> Yeh that's a bit odd.
> 
>> My results are below, (only with the hardware prefetchers enabled).  The
>> program is attached.
>> Joe
>>
>> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
>> ./a.out; done
>> TSC 356144 Kcycles.
>> TSC 356714 Kcycles.
>> TSC 356707 Kcycles.
>> TSC 356565 Kcycles.
>> TSC 356853 Kcycles.
>> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
>> TSC 355520 Kcycles.
>> TSC 355961 Kcycles.
>> TSC 355872 Kcycles.
>> TSC 355948 Kcycles.
>> TSC 355918 Kcycles.
> 
> This basically agrees with the machines I've just tried your test on -
> *except* AMD EPYC 7302P's - that really like the prefetch:
> 
> [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in 
> {1..5}; do ./a.out; done
> TSC 322162 Kcycles.
> TSC 321861 Kcycles. 
> TSC 322212 Kcycles. 
> TSC 321957 Kcycles.
> TSC 322085 Kcycles. 
>  
> [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do 
> ./a.out; done
> TSC 377988 Kcycles. 
> TSC 380125 Kcycles. 
> TSC 379440 Kcycles.
> TSC 379689 Kcycles. 
> TSC 379571 Kcycles. 
>  
> The 1st gen doesn't seem to see much difference with/without it.
> 
> Probably best to leave this code as is!

Regardless the decision of changing the code or not, it would be
nice to have this test committed in the repository to run
performance regression testing from time to time.

>> /*
>>  * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
>>  *
>>  * Compile with either:
>>  *  gcc -mavx buffer_zero_avx.c -O 
>>  * or
>>  *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
>>  */
>>
[...]




reply via email to

[Prev in Thread] Current Thread [Next in Thread]