numba: Running only CUDA tests results in strange crashes / failures
Reporting a bug
- I am using the latest released version of Numba (most recent is visible in the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
- I have included below a minimal working reproducer (if you are unsure how to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).
Running the CUDA testsuite only seems to result in some surprising results. For example, on my system with a V100, running:
python -m numba.runtests numba.cuda.tests -v
eventually dies with:
test_broadcast (numba.cuda.tests.cudadrv.test_cuda_array_slicing.CudaArraySetting) ...
Fatal Python error: Segmentation fault
Current thread 0x00007fb059d49700 (most recent call first):
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/cudadrv/driver.py", line 293 in safe_cuda_api_call
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/cudadrv/driver.py", line 1622 in launch_kernel
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/cudadrv/driver.py", line 1578 in __call__
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/compiler.py", line 614 in _kernel_call
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/compiler.py", line 540 in __call__
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/compiler.py", line 278 in __call__
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/cudadrv/devicearray.py", line 581 in _do_setitem
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/cudadrv/devicearray.py", line 530 in __setitem__
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/cudadrv/devices.py", line 225 in _require_cuda_context
File "/home/nfs/gmarkall/numbadev/numba/numba/cuda/tests/cudadrv/test_cuda_array_slicing.py", line 209 in test_broadcast
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/unittest/case.py", line 628 in run
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/unittest/case.py", line 676 in __call__
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/unittest/suite.py", line 122 in run
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/unittest/suite.py", line 84 in __call__
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/unittest/runner.py", line 176 in run
File "/home/nfs/gmarkall/numbadev/numba/numba/testing/main.py", line 123 in run
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/unittest/main.py", line 271 in runTests
File "/home/nfs/gmarkall/numbadev/numba/numba/testing/main.py", line 354 in run_tests_real
File "/home/nfs/gmarkall/numbadev/numba/numba/testing/main.py", line 369 in runTests
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/unittest/main.py", line 101 in __init__
File "/home/nfs/gmarkall/numbadev/numba/numba/testing/main.py", line 163 in __init__
File "/home/nfs/gmarkall/numbadev/numba/numba/testing/__init__.py", line 75 in run_tests
File "/home/nfs/gmarkall/numbadev/numba/numba/_runtests.py", line 28 in _main
File "/home/nfs/gmarkall/numbadev/numba/numba/runtests.py", line 9 in <module>
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/runpy.py", line 85 in _run_code
File "/home/nfs/gmarkall/miniconda3/envs/numbaenv/lib/python3.7/runpy.py", line 193 in _run_module_as_main
Segmentation fault
The same on @stuartarchibald’s machine with a GTX 750 Ti doesn’t segfault, but has the following failure:
numba/cuda/tests/cudadrv/test_linker.py", line 93, in test_set_registers_57 self.assertEquals(57, compiled._func.get().attrs.regs) AssertionError: 57 != 56
However, the testsuite as a whole has no failures when I run it like:
python -m numba.runtests -m
The test that segfaults for me is getting discovered in this case - it shows up when listing tests with -l - so I don’t think it’s being omitted when running the whole testsuite.
Some initial investigation suggests that some CUDA API calls fail undetected, leading to launch failures later - in one case, computing the threads per block somehow results in a thread count of 0, which a kernel is then configured with and subsequently fails to launch. However, I cannot reliably reproduce this - I’m now getting failures with nonzero threads per block.
I plan to continue investigating this some more. Some other thoughts / hypotheses:
- Perhaps something related to the
CUDATestCaseclass, which is used for a few of the tests, and resets the CUDA context on teardown may be involved - perhaps either resetting the context at an inappropriate time, or perhaps there’s a test class that needs to be aCUDATestCaseand is presently only aunittest.TestCase+SerialMixin. I’m not sure of the exact rationale behind / rules for usingCUDATestCaseas opposed to just theSerialMixinfor CUDA tests - can anyone shed any light on this? - Some memory corruption may be involved - I’m presently looking for a way to run valgrind and also hit the error in a way that isn’t too painful to do.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20 (20 by maintainers)
Commits related to this issue
- Fix OOB write in test_round{f4,f8} The typing of the parameter (int64[:]_ vs. the type of the array that was passed in (np.int32) resulted in 8 bytes being written to a 4-byte allocation. This was ... — committed to gmarkall/numba by gmarkall 4 years ago
- Fix OOB write in test_round{f4,f8} The typing of the parameter (int64[:]_ vs. the type of the array that was passed in (np.int32) resulted in 8 bytes being written to a 4-byte allocation. This was ... — committed to gmarkall/numba by gmarkall 4 years ago
- Merge pull request #5400 from gmarkall/grm-issue-4954 Fix #4954, and some other small CUDA testsuite fixes — committed to numba/numba by sklam 4 years ago
More things:
I’m seeing this too:
Env things: