qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination


From: Aurelien Jarno
Subject: Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
Date: Sun, 15 May 2011 11:24:49 +0200
User-agent: Mutt/1.5.20 (2009-06-14)

On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote:
> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <address@hidden> wrote:
> > On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
> >> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <address@hidden> wrote:
> >> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
> >> >> Here's a RFC series for eliminating AREG0.
> >> >>
> >> >> Blue Swirl (11):
> >> >>   Move user emulator stuff from cpu-exec.c to user-exec.c
> >> >>   Delete unused tb_invalidate_page_range
> >> >>
> >> >> The above should be OK to commit.
> >> >>
> >> >>   cpu_loop_exit: avoid using AREG0
> >> >>   Delegate setup of TCG temporaries to targets
> >> >>
> >> >> These two are not, unless the overall plan is OK.
> >> >>
> >> >>   TCG: fix negative frame offset calculations
> >> >>   TCG/x86: use stack for TCG temps
> >> >>   TCG/Sparc64: use stack for TCG temps
> >> >>
> >> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 
> >> >> hosts.
> >> >>
> >> >>   Add CONFIG_TARGET_NEEDS_AREG0
> >> >>   Don't compile legacy qemu_ld/st functions if target doesn't need them
> >> >>
> >> >> Should be OK, though the latter patch only touches x86.
> >> >>
> >> >>   Add new qemu_ld and qemu_st functions
> >> >>   sparc: use new qemu_ld and qemu_st functions
> >> >>
> >> >> The last two compile but QEMU segfaults. I just made a naive
> >> >> conversion for getting comments.
> >> >>
> >> >
> >> > What is the goal behing removing TCG_AREG0? If it is speed improvement,
> >> > can you please provide some benchmarks?
> >>
> >> There was some discussion earlier about why this (or parts of the
> >> conversion) may be a speed improvement:
> >>
> >> http://article.gmane.org/gmane.comp.emulators.qemu/101826
> >> http://article.gmane.org/gmane.comp.emulators.qemu/102156
> >
> > Ok, looks like I have missed that.
> >
> >> There are no benchmarks yet. In fact, it may be difficult to make
> >> those without performing the removal completely.
> >>
> >> For example, patch 3/11 makes cpu_loop_exit take CPUState as a
> >> parameter instead of using global env, which would be available and
> >> the register is reserved anyway. This would only decrease performance
> >> at this stage, unless a complete conversion is done. I suspect the
> >> same would happen when moving all helpers from op_helper.c to
> >> helper.c. But after the whole conversion, this would be a neutral (no
> >> extra registers used) or even beneficial change (the code is free to
> >> use one more register).
> >>
> >> > The env register is used very often (basically for every load/store, but
> >> > also a lot of helpers), so it makes sense to reserve a register for it.
> >> >
> >> > For what I understand from your patch series, you prefer to pass this
> >> > register explicitly to TCG functions. This basically means this TCG
> >> > global will be loaded to host register as soon as it is used, but also
> >> > regularly, as globals are saved back to their canonical location before
> >> > an helper or a load/store.
> >> >
> >> > So it seems that this patch series will just allowing the "env register"
> >> > to change over time, though it will not spare one more register for the
> >> > TCG code, and it will emit longer TCG code to regularly reload the env
> >> > global into a host register.
> >>
> >> But there will be one more register available in some cases. In other
> >
> > Inside the TCG code, it will basically happens very rarely, given
> > load/store are really the most used instructions, and they need to load
> > the env register.
> 
> Not exactly, from a sample run with -d op_opt:
> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
> 1673966 movi_i32
>  653931 ld_i32

On which target is that? It looks like that some variables could be
mapped directly as a global to avoid so many load/stores.

Anyway, even this simple load instruction, needs a base register, and it
mosts cases this register is TCG_AREG0. I agree that it can be replaced by
another base register, but in anycase, a register has to be used for that.

It means after such an instruction is encountered, one register is
reserved for this global up to the next helper/memory load store, so
there isn't any register spared, but more instructions generated to
explicitly load this register. 

I also just realized all globals registered through tcg_global_mem_new()
needs a base register and currently it is TCG_AREG0, that means even a
mov_i32 might need access to TCG_AREG0. How do you plan to solve that 
part?

>  607432 mov_i32
>  428684 st_i32
>  326878 movi_i64
>  308626 add_i32
>  283186 call
>  256817 exit_tb
>  207232 nopn
>  189388 goto_tb
>  122398 and_i32
>  117997 shr_i32
>   89107 qemu_ld32
>   82926 set_label
>   82713 brcond_i32
>   67169 qemu_st32
>   55109 or_i32
>   46536 ext32u_i64
>   44288 xor_i32
>   38103 sub_i32
>   26361 shl_i32
>   23218 shl_i64
>   23218 qemu_st64
>   23218 or_i64
>   20474 shr_i64
>   20445 qemu_ld64
>   11161 qemu_ld8u
>   10409 qemu_st8
>    5013 qemu_ld16u
>    3795 qemu_st16
>    2776 qemu_ld8s
>    1915 sar_i32
>    1414 qemu_ld16s
>     839 not_i32
>     579 setcond_i32
>     213 br
>      42 ext32s_i64
>      30 mul_i64
> 
> But most other ops probably don't need any additional registers. It
> could still be that with the extra register, some values could be kept
> there instead of flushing to storage.
> 
> >> cases, the number of registers used does not change. Moving the
> >> registers around is what worries me too.
> >>
> >> But there are other effects too, the helpers are now compiled so that
> >> the global env register is not used. Especially on hosts with low
> >> number of registers this is not optimal.
> >
> > Most helpers are very small functions, so I am not sure more registers
> > will help.
> 
> $ nm --print-size --defined-only --size-sort --reverse-sort
> obj-amd64/sparc-softmmu/op_helper.o |head -20
> 0000000000003a70 0000000000000c75 T helper_st_asi
> 00000000000046f0 000000000000086d T helper_ld_asi
> 0000000000002550 000000000000041b T __stw_mmu
> 00000000000012c0 000000000000040c T __stq_mmu
> 0000000000001d10 00000000000003c5 T __stl_mmu
> 0000000000001a90 0000000000000277 T __ldq_mmu
> 00000000000022f0 000000000000025f T __ldl_mmu
> 0000000000002b50 000000000000024f T __ldw_mmu
> 0000000000001870 0000000000000219 t slow_ldq_mmu
> 00000000000020e0 0000000000000208 t slow_ldl_mmu
> 00000000000033f0 0000000000000202 T helper_stqf
> 0000000000002970 00000000000001de t slow_ldw_mmu
> 0000000000005120 00000000000001dd T helper_fcmpeq
> 0000000000003720 00000000000001c7 T helper_ldqf
> 0000000000001120 000000000000019e t slow_stb_mmu
> 00000000000016d0 000000000000019c T __stb_mmu
> 0000000000002da0 000000000000018a t slow_ldb_mmu
> 0000000000005410 0000000000000172 T helper_fcmped
> 0000000000002f30 000000000000016d T __ldb_mmu
> 0000000000000a70 0000000000000164 T do_unassigned_access
> 
> These are not so small, but they also don't look like frequently used
> ones, judging by the names. It would be more interesting to see the
> sizes of heavily used helpers.

Agreed. Also most of these helpers need access to env anyway, so it 
means one more argument that might be kept in a register in some parts 
of the helper. It might be a good idea to recompile this file without
reserving a register for the env pointer, and passing it as an argument
instead just to see the difference.

> >> > In any case at then end benchmarks is what are need to decided, TCG has
> >> > always shown that performance improvements doesn't match the improvement
> >> > analysis.
> >>
> >> If this turns out to be a bad idea, it means that the reverse
> >> conversion will be beneficial and we should convert helper.c code to
> >> op_helper.c and take advantage of the global env in a register. This
> >> was actually my standpoint when the discussion started, but I'm
> >> interested to see if this approach would work better.
> >
> > Does it mean that you plan to do the code changes in git without really
> > benchmarking, and revert all the changes later if it was a bad idea?
> 
> I don't think that that would be a good idea. There are actually
> several possible plans that should be considered, based on the
> performance:
> - don't do anything
> - full conversion
> - avoid AREG0 use outside of generated code: move helpers from
> op_helper.c to helper.c
> - avoid AREG0 in cpu-exec.c etc.
> - maximize AREG0 use: move helpers from helper.c etc. to op_helper.c
> 
> It's also possible to refactor qemu_ld/st independently to the above
> so that they still use AREG0.
> 
> The performance neutral changes (1, 2, 4 to 7) may have merits of
> their own, so they may be applied later if there are no objections.
> 

It seems the choices there don't match the different possible plans
above (5 vs 7). That said I am not really sure that we agree on what
is performance neutral or not.

I have no objection to remove uses of TCG_AREG0 in GCC generated code,
ie in the helpers, though I doubt it will make a real change on the
performances: slightly bigger intermediate code, one more register move
to call the helpers, on more free register for the helpers. That should
not make a huge difference, and I agree it makes the code cleaner.

On the other hand, I have objections to remove uses of TCG_AREG0 from
the TCG part. This register is part of the TCG design and thus used very
often. In any case we have to keep it for accesses to the globals, so we
can keep it for softmmu load/stores too. If we go for the removal of 
TCG_AREG0 from the GCC side, it probably means loading it in the prologue
instead.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
address@hidden                 http://www.aurel32.net



reply via email to

[Prev in Thread] Current Thread [Next in Thread]