On Thu, 01 Feb 2024 09:39:22 PST (-0800), alex.bennee@linaro.org wrote:
Palmer Dabbelt <palmer@dabbelt.com> writes:
On Tue, 30 Jan 2024 12:28:27 PST (-0800), stefanha@gmail.com wrote:
On Tue, 30 Jan 2024 at 14:40, Palmer Dabbelt <palmer@dabbelt.com> wrote:
On Mon, 15 Jan 2024 08:32:59 PST (-0800), stefanha@gmail.com wrote:
> Dear QEMU and KVM communities,
> QEMU will apply for the Google Summer of Code and Outreachy internship
> programs again this year. Regular contributors can submit project
> ideas that they'd like to mentor by replying to this email before
> January 30th.
It's the 30th, sorry if this is late but I just saw it today. +Alistair
and Daniel, as I didn't sync up with anyone about this so not sure if
someone else is looking already (we're not internally).
<snip>
Hi Palmer,
Performance optimization can be challenging for newcomers. I wouldn't
recommend it for a GSoC project unless you have time to seed the
project idea with specific optimizations to implement based on your
experience and profiling. That way the intern has a solid starting
point where they can have a few successes before venturing out to do
their own performance analysis.
Ya, I agree. That's part of the reason why I wasn't sure if it's a
good idea. At least for this one I think there should be some easy to
understand performance issue, as the loops that go very slowly consist
of a small number of instructions and go a lot slower.
I'm actually more worried about this running into a rabbit hole of
adding new TCG operations or even just having no well defined mappings
between RVV and AVX, those might make the project really hard.
You shouldn't have a hard guest-target mapping. But are you already
using the TCGVec types and they are not expanding to AVX when its
available?
Ya, sorry, I guess that was an odd way to describe it. IIUC we're
doing sane stuff, it's just that RISC-V has a very different vector
masking model than other ISAs. I just said AVX there because I only
care about the performance on Intel servers, since that's what I run
QEMU on. I'd asssume we have similar performance problems on other
targets, I just haven't looked.
So my worry would be that the RVV things we're doing slowly just don't
have fast implementations via AVX and thus we run into some
intractable problems. That sort of stuff can be really frusturating
for an intern, as everything's new to them so it can be hard to know
when something's an optimization dead end.
That said, we're seeing 100x slowdows in microbenchmarks and 10x
slowdowns in real code, so I think there sholud be some way to do
better.