[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
Wed, 19 Feb 2020 16:35:07 +0100 (CET)
Alpine 2.22 (BSF 395 2020-01-19)
On Tue, 18 Feb 2020, Programmingkid wrote:
On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <address@hidden> wrote:
While other targets take advantage of using host FPU to do floating
point computations, this was disabled for PPC target because always
clearing exception flags before every FP op made it slightly slower
than emulating everyting with softfloat. To emulate some FPSCR bits,
clearing of fp_status may be necessary (unless these could be handled
e.g. using FP exceptions on host but there's no API for that in QEMU
yet) but preserving at least the inexact flag makes hardfloat usable
and faster than softfloat. Since most clients don't actually care
about this flag, we can gain some speed trading some emulation
This patch implements a simple way to keep the inexact flag set for
hardfloat while still allowing to revert to softfloat for workloads
that need more accurate albeit slower emulation. (Set hardfloat
property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
still be room for further improvement but this seems to increase
floating point performance. Unfortunately the softfloat case is slower
than before this patch so this patch only makes sense if the default
is also set to enable hardfloat.
Because of the above this patch at the moment is mainly for testing
different workloads to evaluate how viable would this be in practice.
Thus, RFC and not ready for merge yet.
Signed-off-by: BALATON Zoltan <address@hidden>
v2: use different approach to avoid needing if () in
helper_reset_fpstatus() but this does not seem to change overhead
much, also make it a single patch as adding the hardfloat option is
only a few lines; with this we can use same value at other places where
float_status is reset and maybe enable hardfloat for a few more places
for a little more performance but not too much. With this I got:
Thank you for working on this. It is about time we have a better FPU.
Thank you for testing it. I think it would be great if we could come up
with some viable approach to improve this before the next freeze.
I applied your patch over David Gibson's ppc-for-5.0 branch. It applied cleanly
and compiled easily.
I've heard some preliminary results from others that there's also a
difference between v1 and v2 of the patch in performance where v1 may be
faster for same cases so if you (or someone else) want and have time you
could experiment with different versions and combinations as well to find
the one that's best on all CPUs. Basically we have these parts:
1. Change target/ppc/fpu_helper.c::helper_reset_fpstatus() to force
float_flag_inexact on in case hadfloat is enabled, I've tried two
approaches for this:
a. In v1 added an if () in the function
b. In v2 used a variable from env set earlier (I've hoped this may be
faster but maybe it's not, testing and explanation is welcome)
2. Also change places where env->fp_status is copied to a local tstat and
then reset (I think this is done to accumulate flags from multiple FP ops
that would individually reset env->fp_status or some other reason, maybe
this could be avoided if we reset fp_status less often but that would need
more understanding of the FP emulation that I don't have so I did not
try to clean that up yet).
If v2 is really slower than v1 then I'm not sure is it because also
changing places with tstat or because of the different approach in
helper_reset_fpstatus() so you could try combinations of these as well.
Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3.
What was the host CPU and OS this was tested on? Please always share CPU
info and host OS when sharing bechmark results so they are somewhat
comparable. It also depends on CPU features for vector instrucions at
least so without CPU info the results could not be understood.
I think G3 does not have AltiVec/VMX so maybe testing with G4 would be
better to also test those ops unless there's a reason to only test G3.
I've tested with G4 both FPU only and FPU+VMX code on Linux host with
i7-9700K CPU @ 3.60GHz as was noted in the original cover letter but may
be I'va also forgotten some details so I list it here again.
I did several tests and here are my results:
With hard float:
- The USB audio device does not produce any sound.
I've heard this could also be due to some other problem not directly
related to FPU, maybe there's a problem with USB/OHCI emulation as well
because problems with that were reported but it's interesting why you get
different results changing FPU related stuff. I think OSX uses float
samples so probably does use FPU for processing sound and may rely on some
pecularity of the hardware as it was probably optimised for Apple
hardware. It would be interesting to find out how FPU stuff is related to
this but since it's broken anyway probably not a show stopper at the
- Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft
float :) ).
Does resulting file match? As a simple test I've verified md5sum of the
resulting mp3 with the lame benchmark I've tried just to find any big
errors. Even if it does not prove that nothing broke, it shuold detect if
something breaks badly. However that was WAV->MP3 where results were same,
although the VMX build did produce different result than FPU only but did
so consistently for multiple runs. With MIDI there might be slight timing
difference that could cause different audio results so you should first
verify if doing the conversion multiple times does produce the same result
at all without any patch first.
For my FPSCR test program, 21 tests failed. The high number is because
the inexact exception is being set for situations it should not be set
Since we force the inexact flag to be set to enable hardfloat this is
expected. More interesting is if apart from this are there any difference
in the results compared to the soffloat case (that may also be host CPU
dependent I think). Do you have more detailed info on the errors and
Some of the problems with inexact may be fixed by not always forcing the
flag on but just not clearing it. As I undersood other targets do that so
it starts with softfloat but the first time the inexact flag is set it
will start using hardfloat as long as the guest does not clear this flag.
Probably this is done to automatically detect code that needs the flag and
assume it's not used when it's not touched. Since PPC also has an inexact
flag just for previous FP op (the FI bit) apart from the usual cumulative
flag, the client could read that instead of clearing the cumulative flag
so we can't detect guest usage this way, teherefore we might as well break
inexact completely to always use hardfloat and need to manually enable it
for guests that we know need it. I'm not sure however if forcing the
inexact flag would lead to unwanted FP exceptions as well so this may also
need to be made conditional on the enabled/disabled status of inexact FP
exceptions. Does anyone have more info on this?
With soft float: - Some sound can be heard from the USB audio device. It
isn't good sounding. I had to force quit Quicktime player because it
- Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard
- 13 tests failed with my FPSCR test program.
This patch is a good start. I'm not worried about the Floating Point
Status and Control Register flags being wrong since hardly any software
bothers to check them. I think more optimizations can happen by
I don't know if guest code checks fpscr and what flags it cares about.
Also don't know if it's a fact that these are not used but maybe if we
test with more guest codes we can find out. That's why I'd like to at
least have an option to test with hardfloat. Unfortunately enabling
hardfloat without also making it default would make it slower so if we go
this way we should make sure we can also enable hardfloat as default.
simplifying the FPU. As it is now it makes a lot of calls per operation.
Question is if those calls are really needed to emulate PPC FPU or if not
why would they be there? If the FPU is really that much different so all
these calls are needed then there's not much to simplify (but maybe there
could be some optimisations possible). This would need someone to
understand the current code in full first that probably we don't currently
(ar least I don't for sure so can't really make changes either). Another
more viable approach is to pick a small part and follow through with that
and try to clean up and optimise that small part only. The exception and
fpscr handling is one such part, another could be round_canonical() that
seems to be high on profiles I've taken. Maybe this could be done by
reading and understading docs only on the small part picked that may be
easier than getting everything first. I wonder if such smaller tasks could
be defined and given out as GSoC or other volunteer projects?
- [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/18
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/18
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Programmingkid, 2020/02/18
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC,
BALATON Zoltan <=
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Programmingkid, 2020/02/24
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/25
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Programmingkid, 2020/02/26
- Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, BALATON Zoltan, 2020/02/26
- R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, luigi burdo, 2020/02/26
- R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Dino Papararo, 2020/02/26
- Re: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC, Alex Bennée, 2020/02/26