qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination


From: Blue Swirl
Subject: Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination
Date: Sun, 15 May 2011 14:12:27 +0300

On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <address@hidden> wrote:
> On Sun, May 15, 2011 at 12:58:49PM +0300, Blue Swirl wrote:
>> On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno <address@hidden> wrote:
>> > On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote:
>> >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <address@hidden> wrote:
>> >> > On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote:
>> >> >> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <address@hidden> 
>> >> >> wrote:
>> >> >> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote:
>> >> >> >> Here's a RFC series for eliminating AREG0.
>> >> >> >>
>> >> >> >> Blue Swirl (11):
>> >> >> >>   Move user emulator stuff from cpu-exec.c to user-exec.c
>> >> >> >>   Delete unused tb_invalidate_page_range
>> >> >> >>
>> >> >> >> The above should be OK to commit.
>> >> >> >>
>> >> >> >>   cpu_loop_exit: avoid using AREG0
>> >> >> >>   Delegate setup of TCG temporaries to targets
>> >> >> >>
>> >> >> >> These two are not, unless the overall plan is OK.
>> >> >> >>
>> >> >> >>   TCG: fix negative frame offset calculations
>> >> >> >>   TCG/x86: use stack for TCG temps
>> >> >> >>   TCG/Sparc64: use stack for TCG temps
>> >> >> >>
>> >> >> >> But these three should be OK. I've tested lightly x86_64 and 
>> >> >> >> Sparc64 hosts.
>> >> >> >>
>> >> >> >>   Add CONFIG_TARGET_NEEDS_AREG0
>> >> >> >>   Don't compile legacy qemu_ld/st functions if target doesn't need 
>> >> >> >> them
>> >> >> >>
>> >> >> >> Should be OK, though the latter patch only touches x86.
>> >> >> >>
>> >> >> >>   Add new qemu_ld and qemu_st functions
>> >> >> >>   sparc: use new qemu_ld and qemu_st functions
>> >> >> >>
>> >> >> >> The last two compile but QEMU segfaults. I just made a naive
>> >> >> >> conversion for getting comments.
>> >> >> >>
>> >> >> >
>> >> >> > What is the goal behing removing TCG_AREG0? If it is speed 
>> >> >> > improvement,
>> >> >> > can you please provide some benchmarks?
>> >> >>
>> >> >> There was some discussion earlier about why this (or parts of the
>> >> >> conversion) may be a speed improvement:
>> >> >>
>> >> >> http://article.gmane.org/gmane.comp.emulators.qemu/101826
>> >> >> http://article.gmane.org/gmane.comp.emulators.qemu/102156
>> >> >
>> >> > Ok, looks like I have missed that.
>> >> >
>> >> >> There are no benchmarks yet. In fact, it may be difficult to make
>> >> >> those without performing the removal completely.
>> >> >>
>> >> >> For example, patch 3/11 makes cpu_loop_exit take CPUState as a
>> >> >> parameter instead of using global env, which would be available and
>> >> >> the register is reserved anyway. This would only decrease performance
>> >> >> at this stage, unless a complete conversion is done. I suspect the
>> >> >> same would happen when moving all helpers from op_helper.c to
>> >> >> helper.c. But after the whole conversion, this would be a neutral (no
>> >> >> extra registers used) or even beneficial change (the code is free to
>> >> >> use one more register).
>> >> >>
>> >> >> > The env register is used very often (basically for every load/store, 
>> >> >> > but
>> >> >> > also a lot of helpers), so it makes sense to reserve a register for 
>> >> >> > it.
>> >> >> >
>> >> >> > For what I understand from your patch series, you prefer to pass this
>> >> >> > register explicitly to TCG functions. This basically means this TCG
>> >> >> > global will be loaded to host register as soon as it is used, but 
>> >> >> > also
>> >> >> > regularly, as globals are saved back to their canonical location 
>> >> >> > before
>> >> >> > an helper or a load/store.
>> >> >> >
>> >> >> > So it seems that this patch series will just allowing the "env 
>> >> >> > register"
>> >> >> > to change over time, though it will not spare one more register for 
>> >> >> > the
>> >> >> > TCG code, and it will emit longer TCG code to regularly reload the 
>> >> >> > env
>> >> >> > global into a host register.
>> >> >>
>> >> >> But there will be one more register available in some cases. In other
>> >> >
>> >> > Inside the TCG code, it will basically happens very rarely, given
>> >> > load/store are really the most used instructions, and they need to load
>> >> > the env register.
>> >>
>> >> Not exactly, from a sample run with -d op_opt:
>> >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC'
>> >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn
>> >> 1673966 movi_i32
>> >>  653931 ld_i32
>> >
>> > On which target is that? It looks like that some variables could be
>> > mapped directly as a global to avoid so many load/stores.
>>
>> Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X.
>>
>> > Anyway, even this simple load instruction, needs a base register, and it
>> > mosts cases this register is TCG_AREG0. I agree that it can be replaced by
>> > another base register, but in anycase, a register has to be used for that.
>> >
>> > It means after such an instruction is encountered, one register is
>> > reserved for this global up to the next helper/memory load store, so
>> > there isn't any register spared, but more instructions generated to
>> > explicitly load this register.
>> >
>> > I also just realized all globals registered through tcg_global_mem_new()
>> > needs a base register and currently it is TCG_AREG0, that means even a
>> > mov_i32 might need access to TCG_AREG0. How do you plan to solve that
>> > part?
>>
>> In most cases the backing storage is CPUState, so cpu_env would be a
>> logical choice.
>>
>> >>  607432 mov_i32
>> >>  428684 st_i32
>> >>  326878 movi_i64
>> >>  308626 add_i32
>> >>  283186 call
>> >>  256817 exit_tb
>> >>  207232 nopn
>> >>  189388 goto_tb
>> >>  122398 and_i32
>> >>  117997 shr_i32
>> >>   89107 qemu_ld32
>> >>   82926 set_label
>> >>   82713 brcond_i32
>> >>   67169 qemu_st32
>> >>   55109 or_i32
>> >>   46536 ext32u_i64
>> >>   44288 xor_i32
>> >>   38103 sub_i32
>> >>   26361 shl_i32
>> >>   23218 shl_i64
>> >>   23218 qemu_st64
>> >>   23218 or_i64
>> >>   20474 shr_i64
>> >>   20445 qemu_ld64
>> >>   11161 qemu_ld8u
>> >>   10409 qemu_st8
>> >>    5013 qemu_ld16u
>> >>    3795 qemu_st16
>> >>    2776 qemu_ld8s
>> >>    1915 sar_i32
>> >>    1414 qemu_ld16s
>> >>     839 not_i32
>> >>     579 setcond_i32
>> >>     213 br
>> >>      42 ext32s_i64
>> >>      30 mul_i64
>> >>
>> >> But most other ops probably don't need any additional registers. It
>> >> could still be that with the extra register, some values could be kept
>> >> there instead of flushing to storage.
>> >>
>> >> >> cases, the number of registers used does not change. Moving the
>> >> >> registers around is what worries me too.
>> >> >>
>> >> >> But there are other effects too, the helpers are now compiled so that
>> >> >> the global env register is not used. Especially on hosts with low
>> >> >> number of registers this is not optimal.
>> >> >
>> >> > Most helpers are very small functions, so I am not sure more registers
>> >> > will help.
>> >>
>> >> $ nm --print-size --defined-only --size-sort --reverse-sort
>> >> obj-amd64/sparc-softmmu/op_helper.o |head -20
>> >> 0000000000003a70 0000000000000c75 T helper_st_asi
>> >> 00000000000046f0 000000000000086d T helper_ld_asi
>> >> 0000000000002550 000000000000041b T __stw_mmu
>> >> 00000000000012c0 000000000000040c T __stq_mmu
>> >> 0000000000001d10 00000000000003c5 T __stl_mmu
>> >> 0000000000001a90 0000000000000277 T __ldq_mmu
>> >> 00000000000022f0 000000000000025f T __ldl_mmu
>> >> 0000000000002b50 000000000000024f T __ldw_mmu
>> >> 0000000000001870 0000000000000219 t slow_ldq_mmu
>> >> 00000000000020e0 0000000000000208 t slow_ldl_mmu
>> >> 00000000000033f0 0000000000000202 T helper_stqf
>> >> 0000000000002970 00000000000001de t slow_ldw_mmu
>> >> 0000000000005120 00000000000001dd T helper_fcmpeq
>> >> 0000000000003720 00000000000001c7 T helper_ldqf
>> >> 0000000000001120 000000000000019e t slow_stb_mmu
>> >> 00000000000016d0 000000000000019c T __stb_mmu
>> >> 0000000000002da0 000000000000018a t slow_ldb_mmu
>> >> 0000000000005410 0000000000000172 T helper_fcmped
>> >> 0000000000002f30 000000000000016d T __ldb_mmu
>> >> 0000000000000a70 0000000000000164 T do_unassigned_access
>> >>
>> >> These are not so small, but they also don't look like frequently used
>> >> ones, judging by the names. It would be more interesting to see the
>> >> sizes of heavily used helpers.
>> >
>> > Agreed. Also most of these helpers need access to env anyway, so it
>> > means one more argument that might be kept in a register in some parts
>> > of the helper. It might be a good idea to recompile this file without
>> > reserving a register for the env pointer, and passing it as an argument
>> > instead just to see the difference.
>>
>> Which boils down to the full conversion. Maybe that is the only way to
>> get meaningful performance statistics.
>>
>> >> >> > In any case at then end benchmarks is what are need to decided, TCG 
>> >> >> > has
>> >> >> > always shown that performance improvements doesn't match the 
>> >> >> > improvement
>> >> >> > analysis.
>> >> >>
>> >> >> If this turns out to be a bad idea, it means that the reverse
>> >> >> conversion will be beneficial and we should convert helper.c code to
>> >> >> op_helper.c and take advantage of the global env in a register. This
>> >> >> was actually my standpoint when the discussion started, but I'm
>> >> >> interested to see if this approach would work better.
>> >> >
>> >> > Does it mean that you plan to do the code changes in git without really
>> >> > benchmarking, and revert all the changes later if it was a bad idea?
>> >>
>> >> I don't think that that would be a good idea. There are actually
>> >> several possible plans that should be considered, based on the
>> >> performance:
>> >> - don't do anything
>> >> - full conversion
>> >> - avoid AREG0 use outside of generated code: move helpers from
>> >> op_helper.c to helper.c
>> >> - avoid AREG0 in cpu-exec.c etc.
>> >> - maximize AREG0 use: move helpers from helper.c etc. to op_helper.c
>> >>
>> >> It's also possible to refactor qemu_ld/st independently to the above
>> >> so that they still use AREG0.
>> >>
>> >> The performance neutral changes (1, 2, 4 to 7) may have merits of
>> >> their own, so they may be applied later if there are no objections.
>> >>
>> >
>> > It seems the choices there don't match the different possible plans
>> > above (5 vs 7). That said I am not really sure that we agree on what
>> > is performance neutral or not.
>>
>> 4 to 7 use regular stack in place of temp_buf array in CPUState. I'd
>
> I still don't get where are this list of possible changes? Did I miss
> something in another thread?

I'm referring to the patches I sent.

>> suppose this does not change anything performance wise, both the top
>> of stack and CPUState are frequently accessed. The generated
>> instructions are the same except for change of base register. Even if
>> we'd go for maximizing AREG0 use, I think this makes sense since we
>> can't use stack pointer for anything else.
>
> I agree that saving temp_local in the stack instead of register
> shouldn't have any impact.
>
>> With some effort, it may be possible to calculate the amount of TCG
>> temps needed and allocate just the right amount of stack. This could
>> be slightly beneficial for performance, now the size of temp_buf array
>> is fixed so it may have negative cache effects. I doubt this would be
>> worthwhile.
>
> Yes, it is something doable, and I agree it probably won't have much
> impact on performance.
>
>> > I have no objection to remove uses of TCG_AREG0 in GCC generated code,
>> > ie in the helpers, though I doubt it will make a real change on the
>> > performances: slightly bigger intermediate code, one more register move
>> > to call the helpers, on more free register for the helpers. That should
>> > not make a huge difference, and I agree it makes the code cleaner.
>>
>> Maybe I'm missing something, butI don't think this would be very
>> beneficial if AREG0 is not eliminated. The cleanup is a bonus,
>> likewise for improved portability.
>
> My point is having AREG0 from GCC generated code, so we don't have to ask
> GCC to explicitly not use the AREG0 register. It basically means
> TCG_ARGE0 disappearing from everywhere, except tcg/* .
>
> On the TCG generated code, the env structure is used almost for every
> op, so it really makes sense to keep it in a register instead of having to
> reload the address of env regularly from memory. Given it only affects
> TCG generated code, I don't see the point of portability here.

For example, maybe the bugs in Sparc glibc could be avoided by using
one of %i set of registers (not accessible from helpers) for AREG0
within generated code instead of %g registers which seem to be
fragile.

>> > On the other hand, I have objections to remove uses of TCG_AREG0 from
>> > the TCG part. This register is part of the TCG design and thus used very
>> > often. In any case we have to keep it for accesses to the globals, so we
>> > can keep it for softmmu load/stores too.
>>
>> Right, any changes would need to be backed by performance figures.
>>
>> > If we go for the removal of
>> > TCG_AREG0 from the GCC side, it probably means loading it in the prologue
>> > instead.
>>
>> Do you mean cpu-exec.c side with this? I think this is a very similar
>> register tradeoff case to helpers.
>
> I mean calling tcg_qemu_tb_exec(tb_ptr, env) instead of
> tcg_qemu_tb_exec(tb_ptr), and modification in the prologue to save the
> env pointer into the TCG_AREG0 register (which is kept internal to TCG
> generated code).

Maybe it would be easy to test and benchmark this change, it doesn't
affect generated code or helpers.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]