[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

From: Programmingkid
Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
Date: Mon, 24 Feb 2020 22:07:05 -0500

> On Feb 19, 2020, at 10:35 AM, BALATON Zoltan <address@hidden> wrote:
> Hello,
> On Tue, 18 Feb 2020, Programmingkid wrote:
>>> On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <address@hidden> wrote:
>>> While other targets take advantage of using host FPU to do floating
>>> point computations, this was disabled for PPC target because always
>>> clearing exception flags before every FP op made it slightly slower
>>> than emulating everyting with softfloat. To emulate some FPSCR bits,
>>> clearing of fp_status may be necessary (unless these could be handled
>>> e.g. using FP exceptions on host but there's no API for that in QEMU
>>> yet) but preserving at least the inexact flag makes hardfloat usable
>>> and faster than softfloat. Since most clients don't actually care
>>> about this flag, we can gain some speed trading some emulation
>>> accuracy.
>>> This patch implements a simple way to keep the inexact flag set for
>>> hardfloat while still allowing to revert to softfloat for workloads
>>> that need more accurate albeit slower emulation. (Set hardfloat
>>> property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
>>> still be room for further improvement but this seems to increase
>>> floating point performance. Unfortunately the softfloat case is slower
>>> than before this patch so this patch only makes sense if the default
>>> is also set to enable hardfloat.
>>> Because of the above this patch at the moment is mainly for testing
>>> different workloads to evaluate how viable would this be in practice.
>>> Thus, RFC and not ready for merge yet.
>>> Signed-off-by: BALATON Zoltan <address@hidden>
>>> ---
>>> v2: use different approach to avoid needing if () in
>>> helper_reset_fpstatus() but this does not seem to change overhead
>>> much, also make it a single patch as adding the hardfloat option is
>>> only a few lines; with this we can use same value at other places where
>>> float_status is reset and maybe enable hardfloat for a few more places
>>> for a little more performance but not too much. With this I got:
>> <snip>
>> Thank you for working on this. It is about time we have a better FPU.
> Thank you for testing it. I think it would be great if we could come up with 
> some viable approach to improve this before the next freeze.
>> I applied your patch over David Gibson's ppc-for-5.0 branch. It applied 
>> cleanly and compiled easily.
> I've heard some preliminary results from others that there's also a 
> difference between v1 and v2 of the patch in performance where v1 may be 
> faster for same cases so if you (or someone else) want and have time you 
> could experiment with different versions and combinations as well to find the 
> one that's best on all CPUs. Basically we have these parts:
> 1. Change target/ppc/fpu_helper.c::helper_reset_fpstatus() to force 
> float_flag_inexact on in case hadfloat is enabled, I've tried two approaches 
> for this:
> a. In v1 added an if () in the function
> b. In v2 used a variable from env set earlier (I've hoped this may be faster 
> but maybe it's not, testing and explanation is welcome)
> 2. Also change places where env->fp_status is copied to a local tstat and 
> then reset (I think this is done to accumulate flags from multiple FP ops 
> that would individually reset env->fp_status or some other reason, maybe this 
> could be avoided if we reset fp_status less often but that would need more 
> understanding of the FP emulation that I don't have so I did not try to clean 
> that up yet).
> If v2 is really slower than v1 then I'm not sure is it because also changing 
> places with tstat or because of the different approach in 
> helper_reset_fpstatus() so you could try combinations of these as well.
>> Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3.
> What was the host CPU and OS this was tested on? Please always share CPU info 
> and host OS when sharing bechmark results so they are somewhat comparable. It 
> also depends on CPU features for vector instrucions at least so without CPU 
> info the results could not be understood.

Intel Core i5-2500S CPU @ 2.70GHz.

> I think G3 does not have AltiVec/VMX so maybe testing with G4 would be better 
> to also test those ops unless there's a reason to only test G3. I've tested 
> with G4 both FPU only and FPU+VMX code on Linux host with i7-9700K CPU @ 
> 3.60GHz as was noted in the original cover letter but may be I'va also 
> forgotten some details so I list it here again.

Ok, I did test on the G4, here are my results:

Git commit: c1e667d2598b9b3ce62b8e89ed22dd38dfe9f57f
Mac OS 10.4.3 VM
-cpu G4
-USB audio device

Audio sounds bad when playing midi file.
Extraction rate: 1.5x
Converting rate: 0.7x
Total time: 7:24

Midi audio sounded perfect for about 30 seconds, then it went silent!
Extraction rate: 1.4x (slower with hard float)
Converting rate: 0.7x (same as without hardfloat)
Total time: 7:16 (faster time with hardfloat)

>> I did several tests and here are my results:
>> With hard float:
>> - The USB audio device does not produce any sound.
> I've heard this could also be due to some other problem not directly related 
> to FPU, maybe there's a problem with USB/OHCI emulation as well because 
> problems with that were reported but it's interesting why you get different 
> results changing FPU related stuff. I think OSX uses float samples so 
> probably does use FPU for processing sound and may rely on some pecularity of 
> the hardware as it was probably optimised for Apple hardware. It would be 
> interesting to find out how FPU stuff is related to this but since it's 
> broken anyway probably not a show stopper at the moment.

When I played sound this second time I hard the same broken audio I usually 
hear with the USB audio device with hardfloat set to false. When playing the 
same midi file with hardfloat set to true, the audio played perfectly! It only 
played for 30 seconds before it went silent.

>> - Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft 
>> float :) ).
> Does resulting file match? As a simple test I've verified md5sum of the 
> resulting mp3 with the lame benchmark I've tried just to find any big errors. 
> Even if it does not prove that nothing broke, it shuold detect if something 
> breaks badly. However that was WAV->MP3 where results were same, although the 
> VMX build did produce different result than FPU only but did so consistently 
> for multiple runs. With MIDI there might be slight timing difference that 
> could cause different audio results so you should first verify if doing the 
> conversion multiple times does produce the same result at all without any 
> patch first.

The md5 number for each file does match. Note this time I converted a midi file 
to MP3.

>> For my FPSCR test program, 21 tests failed. The high number is because the 
>> inexact exception is being set for situations it should not be set for.
> Since we force the inexact flag to be set to enable hardfloat this is 
> expected. More interesting is if apart from this are there any difference in 
> the results compared to the soffloat case (that may also be host CPU 
> dependent I think). Do you have more detailed info on the errors and 
> differences found?

I can give you the full testing suite if you like. I run it on Mac OS 10.4 but 
it should compile with gcc on Linux. I will send it to you in a separate email 
because it is big.


>> With soft float: - Some sound can be heard from the USB audio device. It 
>> isn't good sounding. I had to force quit Quicktime player because it stopped 
>> working.
>> - Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard 
>> float).
>> - 13 tests failed with my FPSCR test program.
>> This patch is a good start. I'm not worried about the Floating Point Status 
>> and Control Register flags being wrong since hardly any software bothers to 
>> check them. I think more optimizations can happen by
> I don't know if guest code checks fpscr and what flags it cares about. Also 
> don't know if it's a fact that these are not used but maybe if we test with 
> more guest codes we can find out. That's why I'd like to at least have an 
> option to test with hardfloat. Unfortunately enabling hardfloat without also 
> making it default would make it slower so if we go this way we should make 
> sure we can also enable hardfloat as default.
>> simplifying the FPU. As it is now it makes a lot of calls per operation.
> Question is if those calls are really needed to emulate PPC FPU or if not why 
> would they be there? If the FPU is really that much different so all these 
> calls are needed then there's not much to simplify (but maybe there could be 
> some optimisations possible). This would need someone to understand the 
> current code in full first that probably we don't currently (ar least I don't 
> for sure so can't really make changes either). Another more viable approach 
> is to pick a small part and follow through with that and try to clean up and 
> optimise that small part only. The exception and fpscr handling is one such 
> part, another could be round_canonical() that seems to be high on profiles 
> I've taken. Maybe this could be done by reading and understading docs only on 
> the small part picked that may be easier than getting everything first. I 
> wonder if such smaller tasks could be defined and given out as GSoC or other 
> volunteer projects?

I have another idea on how to improve QEMU's performance. What if you enabled 
more CPUs for the PowerPC target? Mac OS 9, Mac OS X, and Linux support 
multiple CPUs. It might actually be easier to do this than to improve the FPU. 
I imagine the performance increase with multiple emulated CPUs would be much 
more noticeable. 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]