qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regression in TCG emulation of VTBL neon instruction


From: Ard Biesheuvel
Subject: Re: regression in TCG emulation of VTBL neon instruction
Date: Thu, 5 Nov 2020 00:18:50 +0100

On Wed, 4 Nov 2020 at 21:37, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Ard Biesheuvel <ardb@kernel.org> writes:
>
> > On Wed, 4 Nov 2020 at 18:50, Peter Maydell <peter.maydell@linaro.org> wrote:
> >>
> >> On Wed, 4 Nov 2020 at 17:44, Alex Bennée <alex.bennee@linaro.org> wrote:
> >> > Just checking - what host are you on?
> >>
> >
> > model name : Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> > pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
> > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
> > ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
> > sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
> > rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti
> > ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad
> > fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx
> > rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves
> > dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear
> > flush_l1d
>
> Eyeballing hackbox2 which has:
>
> model name      : Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
> xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl 
> vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid
> dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c 
> rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single 
> pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid 
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a 
> avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw 
> avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total 
> cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d
>
> Seems to have avx512 but the avx1 and avx2 stuff is common which will
> make use of more registers in the generated code:
>
>     if (have_avx1) {
>         tcg_target_available_regs[TCG_TYPE_V64] = ALL_VECTOR_REGS;
>         tcg_target_available_regs[TCG_TYPE_V128] = ALL_VECTOR_REGS;
>     }
>     if (have_avx2) {
>         tcg_target_available_regs[TCG_TYPE_V256] = ALL_VECTOR_REGS;
>     }
>
> >
> >
> >> Oh, good question -- what the TCG backend emits as vector
> >> operations or not will depend on the host CPU (eg whether
> >> it supports AVX1/AVX2/etc).
> >>
> >> If the test case can be cut down to a Linux userspace
> >> program that can be run under the qemu-arm single-binary
> >> emulator that will probably also be easier to debug than
> >> "boot whole guest kernel and wait for it to get to a selftest".
> >>
> >
> > Sure. The code can be found at [0]
> >
> > The sequence in question is
> >
> > # r4 between -31 and 0
> > # q4-q5 holding 32 bytes of cipher stream
> >
> > adr lr, .Lpermute + 32
> > add lr, lr, r4
> > vld1.8 {q2-q3}, [lr]
> >
> > vtbl.8 d4, {q4-q5}, d4
> > vtbl.8 d5, {q4-q5}, d5
> > vtbl.8 d6, {q4-q5}, d6
> > vtbl.8 d7, {q4-q5}, d7
> >
> > .Lpermute:
> >  .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
> >  .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
> >  .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
> >  .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
> >  .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
> >  .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
> >  .byte 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17
> >  .byte 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f
> >
> > This is essentially a bytewise rotate function operating on a 32 byte
> > vector (the patch explains the purpose)
> >
> > Using GDB to single step through the code, I noticed that d6 and d7
> > turn up as all zeroes.
> >
> >
> > [0] 
> > https://lore.kernel.org/linux-arm-kernel/20201103162809.28167-1-ardb@kernel.org/
>
>

So comparing qemu-system-aarch64 and qemu-system-arm running in GDB gives me:

qemu-system-arm:

(gdb) b helper_neon_tbl if maxindex==32
Breakpoint 1 at 0x60e250: file ../target/arm/op_helper.c, line 73.
(gdb) r -M virt -cpu cortex-a15 -m 2048 -net none -nographic -kernel
arch/arm/boot/zImage
Starting program: /home/ardbie01/build/qemu/build/qemu-system-arm -M
virt -cpu cortex-a15 -m 2048 -net none -nographic -kernel
arch/arm/boot/zImage

(gdb) x/8x table
0x555556e6d390: 0xbb75b15a 0xdb0107ff 0x560fe329 0x980e8754
0x555556e6d3a0: 0x08e58eb7 0x814e8602 0x2654e32c 0x979ff7d2

whereas qemu-system-aarch64 gives me

(gdb) x/8x table
0x555556ff8c20: 0xbb75b15a 0xdb0107ff 0x560fe329 0x980e8754
0x555556ff8c30: 0x00000000 0x00000000 0x00000000 0x00000000

Looking at HELPER(neon_tbl)(), it seems to me that casting void *vn to
uint64_t* and indexing it as an array fails to account for the SVE
view of the registers. This also explains why qemu-system-arm works
and qemu-system-aarch64 doesn't.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]