[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Lightning loops indefinitely inside jit_emit on M1 Macs
From: |
Paulo César Pereira de Andrade |
Subject: |
Re: Lightning loops indefinitely inside jit_emit on M1 Macs |
Date: |
Thu, 25 Mar 2021 14:02:52 -0300 |
Em dom., 21 de mar. de 2021 às 06:12, Darren Kulp <darren@kulp.ch> escreveu:
>
> Hello,
Hi,
> I reproduced the issues on a publicly available machine.
>
> Since cfarm (https://cfarm.tetaneutral.net) has just today announced
> availability of an M1 Mac, I reproduced both my problems on that machine
> (gcc304.fsffrance.org). Probably a lot of people on this mailing list have
> accounts there already.
>
> curl -O http://ftp.gnu.org/gnu/lightning/lightning-2.1.3.tar.gz
> tar xf lightning-2.1.3.tar.gz
> cd lightning-2.1.3
> ./configure --enable-assertions
> make
> DYLD_LIBRARY_PATH=$PWD/lib/.libs ./doc/.libs/rfib
> # The above invocation exits an assertion about MAP_FAILED
>
> sed -i '' s/-O2// configure
> sed -i '' 's/MAP_ANON,/MAP_JIT | &/' lib/lightning.c
> ./configure --enable-assertions
> make
> DYLD_LIBRARY_PATH=$PWD/lib/.libs ./doc/.libs/rfib
> # The above invocation exits with a bus error
It looks like it might be required to use mremap, or have some
system wide configuration to allow the sample binaries, and the
check/lightning test tool to execute jit.
I do not know how work the security features. Might be something
like selinux. As long as it is not required to be root to enable
jit execution, I should be able to fix it in the next few days using
the cfarm host.
> I will continue to try when I can to debug the issue, but maybe someone who
> has access to cfarm and who knows lightning will be able to see what I am
> missing.
>
> Darren Kulp
>
> On Mar 20, 2021, at 09:40, Darren Kulp <darren@kulp.ch> wrote:
>
> Hello again,
>
> I think I have learned that my original problem is that MAP_JIT seems to be
> required on M1 Macs (at least on my macOS 11.2.2) when combining PROT_WRITE
> and PROT_EXEC, but there might also be another issue.
>
> I did not originally understand how to build correctly with debugging (since
> `./configure --help` does not seem to show anything related to debugging),
> but after I compiled with `./configure --enable-assertions`, I found that the
> mmap() call was actually failing the first time (with a _jit->code.length of
> 4096) :
>
> kulp@ego lightning-2.1.3 % DYLD_LIBRARY_PATH=$PWD/lib/.libs ./doc/.libs/rfib
> Assertion failed: (_jit->code.ptr != MAP_FAILED), function _jit_emit, file
> lightning.c, line 2027.
> zsh: abort DYLD_LIBRARY_PATH=$PWD/lib/.libs ./doc/.libs/rfib
>
> I found out that macOS has a MAP_JIT flag for mmap() in order to allow
> combining PROT_WRITE and PROT_EXEC :
>
> https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_security_cs_allow-jit
>
> See also comments in this pull request I found :
> https://github.com/herumi/xbyak/pull/84
>
> When I added MAP_JIT flag like this at the affected mmap() call :
>
> _jit->code.ptr = mmap(NULL, _jit->code.length,
> PROT_EXEC | PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANON | MAP_JIT, mmap_fd, 0);
>
> then I no longer saw that assertion. Instead I see a bus error later :
>
> * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS
> (code=2, address=0x1001d0000)
> frame #0: 0x0000000100128ec0 liblightning.1.dylib`_emit_code [inlined]
> _oxxx7(_jit=0x00000001002069f0, Op=-1451229184, Rt=29, Rt2=30, Rn=31,
> Simm7=-20) at jit_aarch64-cpu.c:1027:5 [opt]
> 1024 i.Rt2.b = Rt2;
> 1025 i.Rn.b = Rn;
> 1026 i.imm7.b = Simm7;
> -> 1027 ii(i.w);
>
> but the debugger seems to get mismatching DWARF info when optimizations are
> enabled.
>
> (lldb) p i
> error: Couldn't materialize: couldn't get the value of variable i:
> DW_OP_piece for offset 1 but top of stack is of size 9
> error: errored out in DoExecute, couldn't PrepareToExecuteJITExpression
> (lldb) frame variable
> (jit_state_t *) _jit = 0x0000000100304160
> (jit_int32_t) Op = -1451229184
> (jit_int32_t) Rt = 29
> (jit_int32_t) Rt2 = 30
> (jit_int32_t) Rn = 31
> (jit_int32_t) Simm7 = -20
> (instr_t) i = <DW_OP_piece for offset 1 but top of stack is of size 9>
>
>
> I edited the `configure` to remove `-O2` and rebuilt. Now I get the same bus
> error but I get more information, which I attached in “debugger-state.txt”.
>
> <debugger-state.txt>
>
> I did attach some build logs in case they are helpful (these are with -O2
> still enabled).
>
> kulp@ego lightning-2.1.3 % ./configure --enable-assertions &> configure.output
> kulp@ego lightning-2.1.3 % make V=1 &> make.output
>
> <make.output>
> <config.log>
> <configure.output>
>
> When I get some more time I will look into this further, since I am sure it
> is hard for others to debug it with this information.
>
> Darren Kulp
>
> On Mar 13, 2021, at 12:21, Darren Kulp <darren@kulp.ch> wrote:
>
> Thanks for your response. I picked a busy time for me (starting a new job in
> a new city) so it will take me a bit longer to get back to this than I hoped,
> but I expect to get you a fuller response within a few weeks.
>
> Darren Kulp
>
> On Mar 11, 2021, at 14:40, Paulo César Pereira de Andrade
> <paulo.cesar.pereira.de.andrade@gmail.com> wrote:
>
> Em dom., 7 de mar. de 2021 às 19:17, Darren Kulp <darren@kulp.ch> escreveu:
>
>
> Hello,
>
>
> Hi,
>
> Thank you for GNU lightning. It is a great tool and I have appreciated how
> things generally just work. Right now, I am seeing a rare exception to that
> rule: when I build GNU lightning 2.1.3 on my M1 Macbook (arm64 architecture),
> the generated example codes appear to hang inside jit_emit().
>
> The script below shows what I know so far.
>
> curl -O http://ftp.gnu.org/gnu/lightning/lightning-2.1.3.tar.gz
> tar xf lightning-2.1.3.tar.gz
> cd lightning-2.1.3
> CFLAGS=-g3 LDFLAGS=-g3 ./configure && make
> DYLD_LIBRARY_PATH=$PWD/lib/.libs ./doc/.libs/rfib &
> sleep 5 # arbitrary sleep time to allow to get stuck
> lldb -p $(pgrep rfib)
>
>
> After those commands, an LLDB backtrace shows this :
>
> (lldb) bt
> * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
> * frame #0: 0x000000018e6259c8 libsystem_kernel.dylib`__mmap + 8
> frame #1: 0x000000018e625954 libsystem_kernel.dylib`mmap + 52
> frame #2: 0x000000010025b870
> liblightning.1.dylib`_jit_emit(_jit=0x0000000127e06970) at lightning.c:2065:23
> frame #3: 0x0000000100237d2c rfib`main(argc=1, argv=0x000000016fbcb940) at
> rfib.c:47:9
> frame #4: 0x000000018e679f34 libdyld.dylib`start + 4
>
> Stepping through the code, it appears that the code is stuck in the loop
> starting at line 2033 of lightning.c, and that emit_code() continues to
> return NULL each time it is called.
>
>
> This should only happen if while generating jit, it notices the instruction
> pointer in the mmap'ed area would overflow.
>
> Can you run under a debug environment? It would be very valuable
> to know what value _jit.length has in the first call to emit_code. It
> should have calculated a sane value, but to enter an infinite loop,
> it probably has a negative, and very small value, as it increments
> the size in 4k at a time, and tries again. It really should not even
> loop, as it should never miscalculate that bad.
>
> To debug this issue, it should be enough to set a breakpoint in
> _jit_emit, then a watchpoint on *(long*)_jit->code.length
>
> Should also check what value _jit->code.end has, as it might also
> be somehow getting an incorrect value, but in all conditions, it
> should be due to bad code generation. Can you also share all
> build logs? Maybe the compiler its giving some advice of some
> issue that my test environment on Linux and gcc did not have.
>
> I noticed this problem first when I tried to use Homebrew to install GNU
> lightning on my mac. Homebrew has “bottles” (binary distributions) compiled
> for Intel platforms including macOS Big Sur, but not for M1 Macs as of this
> writing.
>
> kulp@ego /tmp % brew install lightning
> Error: lightning: no bottle available!
> You can try to install from source with:
> brew install --build-from-source lightning
>
> When I tried to install using `brew install --build-from-source lightning`, I
> noticed that the `check` process took 100% CPU for a long time (over an
> hour), so I guessed it must be in an infinite loop, and tried to build it
> myself as I had previously done successfully on Intel Macs. That is when I
> discovered the details I show above.
>
> I would have tried to reproduce the issue on master, but I get stuck with
> autoconf (my autoconf 2.69 rejects some directives in configure.ac).
>
> In a few days I will regain access to my Intel Mac (with older macOS version
> of High Sierra instead of Big Sur), for comparison. Until then, can anyone
> suggest something else I could try in order to narrow things down ?
>
> Darren Kulp
Thanks,
Paulo