Bug: dlltool delaylibs corrupt float/double arguments

From: strager
Subject: Bug: dlltool delaylibs corrupt float/double arguments
Date: Sat, 21 May 2022 16:31:58 -0700

I am calling a function in another x64 DLL with the
following C signature:

    int napi_create_double(void*, double, void*);

The first time I call this function, the 'double' argument
ends up as 1.20305e-307 inside napi_create_double, no matter
what value the caller gives. The 'double' is corrupted.
Calls after the first don't corrupt the 'double'.

The cause is ntdll.dll, eventually called by MinGW's
__delayLoadHelper2, modifying the xmm1 register:

#0  0x00007ffd26ce3006 in ntdll!RtlLookupFunctionEntry () from
#1  0x00007ffd26ce05e8 in ntdll!LdrGetProcedureAddressForCaller ()
from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007ffd26ce00a5 in ntdll!LdrGetProcedureAddressForCaller ()
from C:\WINDOWS\SYSTEM32\ntdll.dll
#3  0x00007ffd245b53dc in KERNELBASE!GetProcAddressForCaller () from
#4  0x00007ffcd7b7ca6f in __delayLoadHelper2 (pidd=0x7ffcd7b8ba70
        ppfnIATEntry=0x7ffcd7ecd134 <__imp_napi_create_double>)
#5  0x00007ffcd7b717c9 in __tailMerge_node_napi_lib ()
       from MYDLL.dll
#6  0x000002ad2fe84c50 in ?? ()

   0x00007ffd26ce2ffb <+1051>:  movups (%rdx),%xmm0
   0x00007ffd26ce2ffe <+1054>:  movups %xmm0,(%rsi)
   0x00007ffd26ce3001 <+1057>:  movsd  0x10(%rdx),%xmm1
=> 0x00007ffd26ce3006 <+1062>:  movsd  %xmm1,0x10(%rsi)
   0x00007ffd26ce300b <+1067>:  mov    (%rsi),%rbp
   0x00007ffd26ce300e <+1070>:  mov    %r11,%rax
   0x00007ffd26ce3011 <+1073>:  lock cmpxchg %r12,0x1384d6(%rip)
 # 0x7ffd26e1b4f0
   0x00007ffd26ce301a <+1082>:  jne    0x7ffd26ce3102

According to Windows x64 documentation, xmm1 is a volatile

I think the solution is for dll's delaylib trampoline to
save xmm1 on the stack before calling __delayLoadHelper2.
I made a patch which does this, and it fixes the bug for my

See attached patch. I think my patch has two problems:

1. AVX/vmovupd/ymm might not be usable on the target
   machine, but saving just xmm isn't enough. Should we
   perform a CPUID check?
2. We store unaligned with vmovupd. Storing aligned with
   vmovapd would be better. I haven't looked into how to
   align ymm registers when storing on the stack.

I'd love to get this bug fixed so others don't spend two
days debugging assembly code!

Matthew "strager" Glazar

