[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] bitops.h: Compile out asserts without --enable-debug
|
From: |
Alex Bennée |
|
Subject: |
Re: [PATCH] bitops.h: Compile out asserts without --enable-debug |
|
Date: |
Tue, 23 May 2023 07:43:41 +0100 |
|
User-agent: |
mu4e 1.11.6; emacs 29.0.91 |
Richard Henderson <richard.henderson@linaro.org> writes:
> On 5/22/23 15:26, BALATON Zoltan wrote:
>> On Mon, 22 May 2023, Alex Bennée wrote:
>>> (ajb: add Richard for his compiler-fu)
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>
>>>>>> The low level extract and deposit funtions provided by bitops.h are
>>>>>> used in performance critical places. It crept into target/ppc via
>>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>>>> where hardfloat is also disabled is doubly affected.
>>>>>
>>>>> Most of these asserts compile out to nothing if the compiler is able to
>>>>> verify the constants are in the range. For example examining
>>>>> the start of float64_add:
>>>>>
>>> <snip>
>>>>>
>>>>> I don't see any check and abort steps because all the shift and mask
>>>>> values are known at compile time. The softfloat compilation certainly
>>>>> does have some assert points though:
>>>>>
>>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o |grep assert
>>>>> 136: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND
>>>>> g_assertion_mess[...]
>>>>> 138: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND __assert_fail
>>>>>
>>>>> but the references are for the ISRA segments so its tricky to know if
>>>>> they get used or are just there for LTO purposes.
>>>>>
>>>>> If there are hot-paths that show up the extract/deposit functions I
>>>>> suspect a better approach would be to implement _nocheck variants (or
>>>>> maybe _noassert?) and use them where required rather than turning off
>>>>> the assert checking for these utility functions.
>>>>
>>>> Just to clarify again, the asserts are still there when compiled with
>>>> --enable-debug. The patch only turns them off for optimised release
>>>> builds which I think makes sense if these asserts are to catch
>>>> programming errors.
>>>
>>> Well as Peter said the general policy is to keep asserts in but I
>>> appreciate this is a hotpath case.
>>>
>>>> I think I've also suggested adding noassert
>>>> versions of these but that wasn't a popular idea and it may also not
>>>> be easy to convert all places to use that like for example the
>>>> register fields related usage in target/ppc as that would also affect
>>>> other places.
>>>
>>> Is code generation or device emulation really on the hot-path. Generally
>>> a well predicted assert is in the noise for those operations.
>> They aren't in code generation but in helpers as you can also see in
>> the profile below and so they can be on hot path. Also I've noticed
>> that extract8 and extract16 just call extract32 after adding another
>> assert on their own in addition to the one in extract32 which is
>> double overhead for really no reason. I'd delete all these asserts
>> as the likelhood of bugs these could catch is very low anyway (how
>> often do you expect somebody to call these with out of bound values
>> that would not be obvious from the results otherwise?) but leaving
>> them in non-debug builds is totally useless in my opinion.
>>
>>>> So this seems to be the simplest and most effective
>>>> approach.
>>>>
>>>> The softfloat related usage in these tests I've done seem to mostly
>>>> come from unpacking and repacking floats in softfloat which is done
>>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>>>> small overheads add app quickly. Just 1 muladd will result in 9
>>>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>>>> op adds a bunch more. I did some profiling with perf to find these.
>>>
>>> After some messing about trying to get lame to cross compile to a static
>>> binary I was able to replicate what you've seen:
>>>
>>> 11.44% qemu-ppc64 qemu-ppc64 [.] unpack_raw64.isra.0
>>> 11.03% qemu-ppc64 qemu-ppc64 [.] parts64_uncanon_normal
>>> 8.26% qemu-ppc64 qemu-ppc64 [.]
>>> helper_compute_fprf_float64
>>> 6.75% qemu-ppc64 qemu-ppc64 [.] do_float_check_status
>>> 5.34% qemu-ppc64 qemu-ppc64 [.] parts64_muladd
>>> 4.75% qemu-ppc64 qemu-ppc64 [.] pack_raw64.isra.0
>>> 4.38% qemu-ppc64 qemu-ppc64 [.] parts64_canonicalize
>>> 3.62% qemu-ppc64 qemu-ppc64 [.]
>>> float64r32_round_pack_canonical
>>> 3.32% qemu-ppc64 qemu-ppc64 [.] helper_todouble
>>> 2.68% qemu-ppc64 qemu-ppc64 [.] float64_add
>>> 2.51% qemu-ppc64 qemu-ppc64 [.] float64_hs_compare
>>> 2.30% qemu-ppc64 qemu-ppc64 [.] float64r32_muladd
>>> 1.80% qemu-ppc64 qemu-ppc64 [.] float64r32_mul
>>> 1.40% qemu-ppc64 qemu-ppc64 [.] float64r32_add
>>> 1.34% qemu-ppc64 qemu-ppc64 [.] parts64_mul
>>> 1.16% qemu-ppc64 qemu-ppc64 [.] parts64_addsub
>>> 1.14% qemu-ppc64 qemu-ppc64 [.] helper_reset_fpstatus
>>> 1.06% qemu-ppc64 qemu-ppc64 [.] helper_float_check_status
>>> 1.04% qemu-ppc64 qemu-ppc64 [.] float64_muladd
>> I've run 32 bit PPC version in qemu-system-ppc so the profile is a
>> bit different (has more system related overhead that I plan to look
>> at separately) but this part is similar to the above. I also wonder
>> what makes helper_compute_fprf_float64 a bottleneck as that does not
>> seem to have much extract/deposit, only a call to clz but it's hard
>> to tell what it really does due to nested calls and macros. I've
>> also seen this function among the top contenders in my profiling.
>>
>>> what I find confusing is the fact the parts extraction and packing
>>> should all be known constants which should cause the asserts to
>>> disappear. However it looks like the compiler decided to bring in a copy
>>> of the whole inline function (ISRA = >interprocedural scalar replacement
>>> of aggregates) which obviously can't fold the constants and eliminate
>>> the assert.
>> Could it be related to that while the parts size and start are
>> marked const but pulled out of a struct field so the compiler may
>> not know their actual value until run time?
>>
>>> Richard,
>>>
>>> Any idea of why the compiler might decide to do something like this?
>
> Try this.
That seems to have done the trick, translated code is now dominating the
profile:
+ 14.12% 0.00% qemu-ppc64 [unknown] [.]
0x0000004000619420
+ 13.30% 0.00% qemu-ppc64 [unknown] [.]
0x0000004000616850
+ 12.58% 12.19% qemu-ppc64 qemu-ppc64 [.]
parts64_uncanon_normal
+ 10.62% 0.00% qemu-ppc64 [unknown] [.]
0x000000400061bf70
+ 9.91% 9.73% qemu-ppc64 qemu-ppc64 [.]
helper_compute_fprf_float64
+ 7.84% 7.82% qemu-ppc64 qemu-ppc64 [.]
do_float_check_status
+ 6.47% 5.78% qemu-ppc64 qemu-ppc64 [.]
parts64_canonicalize.constprop.0
+ 6.46% 0.00% qemu-ppc64 [unknown] [.]
0x0000004000620130
+ 6.42% 0.00% qemu-ppc64 [unknown] [.]
0x0000004000619400
+ 6.17% 6.04% qemu-ppc64 qemu-ppc64 [.] parts64_muladd
+ 5.85% 0.00% qemu-ppc64 [unknown] [.]
0x00000040006167e0
+ 5.74% 0.00% qemu-ppc64 [unknown] [.]
0x0000b693fcffffd3
+ 5.45% 4.78% qemu-ppc64 qemu-ppc64 [.]
float64r32_round_pack_canonical
+ 4.79% 0.00% qemu-ppc64 [unknown] [.]
0xfaff203940fe8240
+ 4.29% 3.82% qemu-ppc64 qemu-ppc64 [.] float64r32_muladd
+ 4.20% 3.87% qemu-ppc64 qemu-ppc64 [.] helper_todouble
+ 3.51% 2.98% qemu-ppc64 qemu-ppc64 [.] float64r32_mul
+ 3.12% 2.97% qemu-ppc64 qemu-ppc64 [.]
float64_hs_compare
+ 3.06% 2.95% qemu-ppc64 qemu-ppc64 [.] float64_add
+ 2.88% 2.35% qemu-ppc64 qemu-ppc64 [.] float64r32_add
+ 2.65% 0.00% qemu-ppc64 [unknown] [.]
0x6c555e9100004039
+ 2.55% 1.30% qemu-ppc64 qemu-ppc64 [.]
helper_float_check_status
eyeballing the float functions and I can't see any asserts expressed.
Before:
Executed in 721.65 secs fish external
usr time 721.38 secs 561.00 micros 721.38 secs
sys time 0.20 secs 261.00 micros 0.20 secs
After:
Executed in 650.38 secs fish external
usr time 650.11 secs 0.00 micros 650.11 secs
sys time 0.20 secs 989.00 micros 0.20 secs
>
>
> r~
>
> [2. text/x-patch; z.patch]...
--
Alex Bennée
Virtualisation Tech Lead @ Linaro