numba: M1 LLVM Runtimedyld Invalid page reloc value assertion error

We are seeing a LLVM Assertion error occurring randomly in our build farm.

The error message is:

Assertion failed: (isInt<33>(Addend) && "Invalid page reloc value."), function encodeAddend, file /path/to/conda-bld/llvmdev_1643905487494/work/lib/ExecutionEngine/RuntimeDyld/Targets/RuntimeDyldMachOAArch64.h, line 210

Earliest report is from gitter on July 15, 2022

The error can be triggered with the below script on bdb2384. The error usually occurs within 10 iteration.

!python setup.py build_ext --inplace 
c = 0
_exit_code = 0
tests = """
numba.tests.test_stencils.TestManyStencils.test_basic40
numba.tests.test_stencils.TestManyStencils.test_basic70
numba.tests.test_array_constants.TestConstantArray.test_too_big_to_freeze
numba.tests.test_array_manipulation.TestArrayManipulation.test_fill_diagonal_basic
""".split()
cmdarg = ' '.join(tests)
while _exit_code == 0 and c < 150:
    print(f"c={c}".center(80, '='))
    !NUMBA_OPT=0 python -m unittest -vb $cmdarg $cmdarg
    c += 1
    print(f"exit={_exit_code}")
    assert _exit_code == 0

The error occurs in both LLVM 11 and LLVM 14.

The current hypothesis is that the LLVM Runtimedyld is mishandling far jumps. To relate this to the reproducer above, the situation can be created by:

first JITing some stencil kernels, which tend to be large and esp. larger when OPT=0
allocating large amount of memory as in test_too_big_to_freeze (the compilation and execution bits in the tests can be commented out and it will still trigger the error)
JITing more array operations as in test_fill_diagonal_basic. The assertion error occurs here. The guess is that JITed code emitted for the stencil tests are reused here. The large allocation in between help make sure there is a gap/fragmentation in the memory space such that the fill_diagonal functions are JITed in somewhere far away.

Julia devs is pointing to a broken large code model in LLVM Runtimedyld for MachO aarch64. See https://github.com/JuliaLang/julia/issues/42295#issuecomment-1008427270, https://github.com/JuliaLang/julia/pull/43664.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 47 (25 by maintainers)

Commits related to this issue

Skip tests that contributes to M1 RuntimeDyLd Assertion error #8567 — committed to sklam/numba by sklam 2 years ago

Most upvoted comments

@gmarkall, I can’t confirm whether it’ll ever fail, but it no longer fails for the particular script that would fail roughly 50% of the time previously. Ran it ~20 times with different initial conditions.

jacobjivanov on Dec 21, 2023

@jacobjivanov Thanks for sharing this info - fortunately you don’t need to build from source to test the fix now, as it’s part of the llvmlite 0.42 / Numba 0.59 release candidates. You can follow the instructions here to install the Numba and llvmlite release candidates: https://numba.discourse.group/t/ann-numba-0-59-0rc1-and-llvmlite-0-42-0rc1/2329

If you try this, I’d really appreciate if you can let me know whether it appears to have solved the issue for you.

gmarkall on Dec 21, 2023

Bump.

I am consistently seeing this on M1 Pro and M2. It’s a bit involved, but it occurs with ~30% probability in my code.

Are you still looking for a reproducer @gmarkall ?

PhilipVinc on Oct 19, 2023

Might take a while to get there as our developers naturally have a strong python background. We will start with a minimal nixtla setup, which is where this popped up for us. And from there on we will work our way down.

carstenr on Aug 28, 2023

Alright, that means we got two large cases then to reporoduce. We will focus on reducing is as much as possible.

Another thought I think worth sharing - it should be possible to get to a reproducer that doesn’t depend on Numba at all - if it’s minimised as much as possible, it would just involve calls to llvmlite. (Or even simpler than that, a small C++ source that links to LLVM only, to even take llvmlite out of the loop - but I think the “just llvmlite” case would already be a good starting point)

gmarkall on Aug 25, 2023

@carstenr I had a little more thought about this recently… One of the problems that makes it hard to think about a fix is that reproducing the issue is a giant pain at present - if you’re able to do anything to take the existing reproducers and simplify them at all, that would help make it easier for someone (or yourself) to understand the issue and work on a fix.

I can give you a script that is able to reproduce it quite often if that can help

Francyrad on Aug 24, 2023