First thing that caught my attention is vext_ldst_us from
target/riscv/vector_helper.c:
/* load bytes from guest memory */
for (i = env->vstart; i < evl; i++, env->vstart++) {
k = 0;
while (k < nf) {
target_ulong addr = base + ((i * nf + k) << log2_esz);
ldst_elem(env, adjust_addr(env, addr), i + k * max_elems,
vd, ra);
k++;
}
}
env->vstart = 0;
Given that this is a unit-stride load that access contiguous elements in
memory it
seems that this loop could be optimized/removed since it's
loading/storing bytes
one by one. I didn't find any TCG op to do that though. I assume that
ARM SVE might
have something of the sorts. Richard, care to comment?