qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: About hardfloat in ppc


From: BALATON Zoltan
Subject: Re: About hardfloat in ppc
Date: Fri, 1 May 2020 15:39:06 +0200 (CEST)
User-agent: Alpine 2.22 (BSF 395 2020-01-19)

On Fri, 1 May 2020, Alex Bennée wrote:
罗勇刚(Yonggang Luo) <address@hidden> writes:
On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <address@hidden> wrote:
On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
That's what I suggested,
We preserve a  float computing cache
typedef struct FpRecord {
 uint8_t op;
 float32 A;
 float32 B;
}  FpRecord;
FpRecord fp_cache[1024];
int fp_cache_length;
uint32_t fp_exceptions;

1. For each new fp operation we push it to the  fp_cache,
2. Once we read the fp_exceptions , then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.
and clear  fp_cache_length.

Why do you need to store more than the last fp op? The cumulative bits can
be tracked like it's done for other targets by not clearing fp_status then
you can read it from there. Only the non-sticky FI bit needs to be
computed but that's only determined by the last op so it's enough to
remember that and run that with softfloat (or even hardfloat after
clearing status but softfloat may be faster for this) to get the bits for
last op when status is read.

Yeap, store only the last fp op is also an option. Do you means that store
the last fp op,
and calculate it when necessary?  I am thinking about a general fp
optmize method that suite
for all target.

I think that's getting a little ahead of yourself. Let's prove the
technique is valuable for PPC (given it has the most to gain). We can
always generalise later if it's worthwhile.

Rather than creating a new structure I would suggest creating 3 new tcg
globals (op, inA, inB) and re-factor the front-end code so each FP op
loaded the TCG globals.

So that's basically wherever you see helper_reset_fpstatus() in target/ppc we would need to replace it with saving op and args to globals? Or just repurpose this helper to do that. This is called before every fp op but not before sub ops within vector ops. Is that correct? Probably it is, as vector ops are a single op but how do we detect changes in flags by sub ops for those? These might have some existing bugs I think.

The TCG optimizer should pick up aliased loads
and automatically eliminate the dead ones. We might need some new
machinery for the TCG to avoid spilling the values over potentially
faulting loads/stores but that is likely a phase 2 problem.

I have no idea how to do this or even where to look. Some more detailed explanation may be needed here.

Next you will want to find places that care about the per-op bits of
cpu_fpscr and call a helper with the new globals to re-run the
computation and feed the values in.

So the code that cares about these bits are in guest thus we would need to compute it if we detect the guest accessing these. Detecting when the individual bits are accessed might be difficult so at first we could go for checking if the fpscr is read and recompute FI bit then before returning value. You previously said these might be when fpscr is read or when generating exceptions but not sure where exactly are these done for ppc. (I'd expect to have mffpscr but there seem to be different other ops instead accessing parts of fpscr which are found in target/ppc/fp-impl.inc.c:567 so this would need studying the PPC docs to understand how the guest can access the FI bit of fpscr reg.)

That would give you a reasonable working prototype to start doing some
measurements of overhead and if it makes a difference.



3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
clear  fp_exceptions.
4. If the  fp_cache are full, then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.

All this cache management and more than one element seems unnecessary to
me although I may be missing something.

Now the keypoint is how to tracking the read and write of FPSCR register,
The current code are
   cpu_fpscr = tcg_global_mem_new(cpu_env,
                                  offsetof(CPUPPCState, fpscr), "fpscr");

Maybe you could search where the value is read which should be the places
where we need to handle it but changes may be needed to make a clear API
for this between target/ppc, TCG and softfloat which likely does not
exist yet.

Once the per-op calculation is fixed in the PPC front-end I thing the
only change needed is to remove the #if defined(TARGET_PPC) in
softfloat.c - it's only really there because it avoids the overhead of
checking flags which we always know to be clear in it's case.

That's the theory but I've found that removing that define currently makes general fp ops slower but vector ops faster so I think there may be some bugs that would need to be found and fixed. So testing with some proper test suite might be needed.

Regards,
BALATON Zoltan

reply via email to

[Prev in Thread] Current Thread [Next in Thread]