qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion


From: Kirill Batuzov
Subject: Re: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion
Date: Tue, 22 Aug 2017 12:04:23 +0300 (MSK)
User-agent: Alpine 2.20 (DEB 67 2015-01-07)

On Fri, 18 Aug 2017, Richard Henderson wrote:

> On 08/18/2017 04:33 AM, Kirill Batuzov wrote:
> > From my own experimentations some times ago,
> > 
> > (1) translating vector instructions to vector instructions in TCG is faster 
> > than
> > 
> > (2) translating vector instructions to series of scalar instructions in TCG,
> > which is faster than*
> > 
> > (3) translating vector instructions to single helper calls, which is faster
> > than*
> > 
> > (4) translating vector instructions to helper calls for each vector element.
> > 
> > (*) (2) and (3) may change their respective places in case of some
> > complicated instructions.
> 
> This was my gut feeling as well.  With the caveat that for the ARM SVE case of
> 2048-bit registers we cannot afford to expand inline due to generated code 
> size.
> 
> > ARM (at least ARM32, I have not checked aarch64 in this regard) uses the
> > last, the slowest scheme. As far as I understand, you are want to change
> > it to the third approach. This approach is used in SSE emulation, may be
> > you can use similar structure of helpers?
> > 
> > I still hope to finish my own series about implementation of the first
> > approach. I apologize for the long delay since last update and hope to
> > send next version somewhere next week. I do not think our series
> > contradict each other: you are trying to optimize existing general
> > purpose case while I'm trying to optimize case where both host and guest
> > support vector instructions. Since I'm experimenting on ARM32, we'll not
> > have much merge conflicts either.
> 
> I posted my own, different, take on vectorization yesterday as well.
> 
>   http://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg03272.html
> 
> The primary difference between my version and your version is that I do not
> allow target/cpu/translate*.c to create vector types.  All of the host vector
> expansion is done within tcg/*.c.

I took a look at your approach. The only problem with it is that in
current implementation it does not allow to keep vector variables on
register between consecutive guest instructions. But this can be
changed. To do it we need to make copy propagation work with memory
locations as well, and dead code elimination to be able to remove excess
stores to memory. While in general case these can be troublesome if we
limit analysis to addresses that are [env + Const] it becomes relatively
easy. I've done similar thing in my series to track interference between
memory operations and vector global variables. In case of your series
this affects only performance so it does not need to be added in the
initial series and can be added later as a separate patch. I can care of
this once initial series are pulled to master.

Overall I like your approach the most out of three:
 - it handles different representations of guest vectors with host
   vectors seamlessly (unlike my approach where I still do not know how
   to make it right),
 - it provides better performance than Alex's (and the same as mine once
   we add a bit of alias analysis),
 - it moves in the direction of representing guest vectors not as
   globals, but as a pair (offset, size) in a special address space
   (this approach was successfully used in Valgrind and it handles
   intersecting registers much better than what we have now; we are
   moving in this direction anyway).

-- 
Kirill



reply via email to

[Prev in Thread] Current Thread [Next in Thread]