qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TCG performance on PPC64


From: Matheus K. Ferst
Subject: Re: TCG performance on PPC64
Date: Thu, 26 May 2022 08:07:07 -0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0

On 19/05/2022 01:13, David Gibson wrote:
>> What would be different in aarch64 emulation that yields a better
>> performance on our POWER9?
>> - I suppose that aarch64 has more instructions with GVec implementations
>> than PPC64 and s390x, so maybe aarch64 guests can better use host-vector
>> instructions?
>
> As with Richard, I think it's pretty unlikely that this would make
> such a difference.  With a pure number crunching vector workload in
> the guest, maybe, with kernel & userspace boot, not really.  It might
> be interesting to configure a guest CPU without vector support to
> double check if it makes any differece though.
>
>>  - Looking at the flame graphs of each test (attached), I can see that
>> tb_gen_code takes proportionally less time of aarch64 emulation than PPC64
>> and s390x, so it might be that decodetree is faster?
>> - There is more than TCG at play, so perhaps the differences can be better
>> explained by VirtIO performance or something else?
>
> Also seems unlikely to me; I don't really see how this would differ
> enough based on guest type to make the difference we see here.
>
>> Currently, Leandro Lupori is working to improve TLB invalidation[7], Victor >> Colombo is working to enable hardfpu in some scenarios, and I'm reviewing
>> some older helpers that can use GVec or easily implemented inline. We're
>> also planning to add some Power ISA v3.1 instructions to the TCG backend, >> but it's probably better to test on hardware if our changes are doing any
>> good, and we don't have access to a POWER10 yet.
>>
>> Are there any other known performance problems for TCG on PPC64 that we
>> should investigate?
>
> Known?  I don't think so.  The TCG code is pretty old and clunky
> though, so there could be all manner of problems lurking in there.
>
>
> A couple of thougts:
>
>  * I wonder how much emulation of guest side synchronization
>    instructions might be a factor here.  That's one of the few things
>    I can think of where the matchup between host and guest models
>    might make a difference.

That's an interesting suggestion, we'll be looking into this. It seems similar to Nicholas Piggin's recent works, and there is probably more to be done in this area.

>  It might be interesting to try these
>    tests with single core guests.  Likewise it might be interesting to
>    get results with multi-core guests, but MTTCG explicitly disabled.
>

With 50 runs:

+---------+--------------------------------+
|         |              Host              |
| Options +---------------+----------------+
|         |     PPC64     |     x86_64     |
+---------+---------------+----------------+
| -smp 2  | 427.41 ± 7.89 |  350.89 ± 7.62 |
| -smp 1  | 574.01 ± 4.18 | 411.27 ± 17.14 |
| No MTTCG| 588.84 ± 8.50 | 445.30 ± 21.66 |
+---------+---------------+----------------+

The gap with x86 has increased in the two new cases, but I'm not sure if I can draw anything from this result. Maybe it's just SMT vs. Hyper-Thread that benefits POWER9 in the initial test, or the Xeon is better at boosting a single core when QEMU uses only one thread.

>  * It might also be interesting to get CPU time results as well as
>    elapsed time.  That might indicate whether qemu is doing more
>    actual work in the slow cases, or if it's blocking for some
>    non-obvious reason.

The results above and in my first email were wall clock time, but I also have user and system times on a GitHub wiki page: https://github.com/PPC64/qemu/wiki/TCG-Performance-on-PPC64

Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO <http://www.eldorado.org.br/>
Analista de Software
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>

reply via email to

[Prev in Thread] Current Thread [Next in Thread]