numba: M1 LLVM Runtimedyld Invalid page reloc value assertion error
We are seeing a LLVM Assertion error occurring randomly in our build farm.
The error message is:
Assertion failed: (isInt<33>(Addend) && "Invalid page reloc value."), function encodeAddend, file /path/to/conda-bld/llvmdev_1643905487494/work/lib/ExecutionEngine/RuntimeDyld/Targets/RuntimeDyldMachOAArch64.h, line 210
Earliest report is from gitter on July 15, 2022
The error can be triggered with the below script on bdb2384. The error usually occurs within 10 iteration.
!python setup.py build_ext --inplace
c = 0
_exit_code = 0
tests = """
numba.tests.test_stencils.TestManyStencils.test_basic40
numba.tests.test_stencils.TestManyStencils.test_basic70
numba.tests.test_array_constants.TestConstantArray.test_too_big_to_freeze
numba.tests.test_array_manipulation.TestArrayManipulation.test_fill_diagonal_basic
""".split()
cmdarg = ' '.join(tests)
while _exit_code == 0 and c < 150:
print(f"c={c}".center(80, '='))
!NUMBA_OPT=0 python -m unittest -vb $cmdarg $cmdarg
c += 1
print(f"exit={_exit_code}")
assert _exit_code == 0
The error occurs in both LLVM 11 and LLVM 14.
The current hypothesis is that the LLVM Runtimedyld is mishandling far jumps. To relate this to the reproducer above, the situation can be created by:
- first JITing some stencil kernels, which tend to be large and esp. larger when OPT=0
- allocating large amount of memory as in
test_too_big_to_freeze(the compilation and execution bits in the tests can be commented out and it will still trigger the error) - JITing more array operations as in
test_fill_diagonal_basic. The assertion error occurs here. The guess is that JITed code emitted for the stencil tests are reused here. The large allocation in between help make sure there is a gap/fragmentation in the memory space such that the fill_diagonal functions are JITed in somewhere far away.
Julia devs is pointing to a broken large code model in LLVM Runtimedyld for MachO aarch64. See https://github.com/JuliaLang/julia/issues/42295#issuecomment-1008427270, https://github.com/JuliaLang/julia/pull/43664.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 47 (25 by maintainers)
Commits related to this issue
- Skip tests that contributes to M1 RuntimeDyLd Assertion error #8567 — committed to sklam/numba by sklam 2 years ago
@gmarkall, I can’t confirm whether it’ll ever fail, but it no longer fails for the particular script that would fail roughly 50% of the time previously. Ran it ~20 times with different initial conditions.
@jacobjivanov Thanks for sharing this info - fortunately you don’t need to build from source to test the fix now, as it’s part of the llvmlite 0.42 / Numba 0.59 release candidates. You can follow the instructions here to install the Numba and llvmlite release candidates: https://numba.discourse.group/t/ann-numba-0-59-0rc1-and-llvmlite-0-42-0rc1/2329
If you try this, I’d really appreciate if you can let me know whether it appears to have solved the issue for you.
Bump.
I am consistently seeing this on M1 Pro and M2. It’s a bit involved, but it occurs with ~30% probability in my code.
Are you still looking for a reproducer @gmarkall ?
Might take a while to get there as our developers naturally have a strong python background. We will start with a minimal nixtla setup, which is where this popped up for us. And from there on we will work our way down.
Another thought I think worth sharing - it should be possible to get to a reproducer that doesn’t depend on Numba at all - if it’s minimised as much as possible, it would just involve calls to llvmlite. (Or even simpler than that, a small C++ source that links to LLVM only, to even take llvmlite out of the loop - but I think the “just llvmlite” case would already be a good starting point)
I can give you a script that is able to reproduce it quite often if that can help