ucx-py: Failing Numba Test on DGX
On DGX15 I see get a failing test when running all the numba tests in succession
UCX_MEMTYPE_CACHE=n UCX_TLS=rc,tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm py.test -s -v tests/test_send_recv.py::test_send_recv_numba
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.3, pytest-5.2.0, py-1.8.0, pluggy-0.13.0 -- /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/nfs/bzaitlen/GitRepos/ucx-py/.hypothesis/examples')
rootdir: /home/nfs/bzaitlen/GitRepos/ucx-py
plugins: hypothesis-4.38.1, asyncio-0.10.0
collected 21 items
tests/test_send_recv.py::test_send_recv_numba[|u1-1] [1569987737.314062] [dgx15:22850:0] parser.c:1568 UCX WARN unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-16] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-256] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-4096] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-65536] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-1048576] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-16777216] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-1] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-16] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-256] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-4096] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-65536] [dgx15:22850:0:22850] ib_mlx5_log.c:139 Remote access on mlx5_1:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[dgx15:22850:0:22850] ib_mlx5_log.c:139 RC QP 0x648 wqe[3]: RDMA_WRITE s-- [rva 0x7f290fa80000 rkey 0x20cfc] [va 0x7f290fa00000 len 524288 lkey 0x24dad]
==== backtrace ====
0 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucs.so.0(ucs_fatal_error_message+0xdf) [0x7f2933f5cdce]
1 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucs.so.0(ucs_log_default_handler+0x159) [0x7f2933f60a2a]
2 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucs.so.0(ucs_log_dispatch+0xf8) [0x7f2933f60c62]
3 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x667) [0x7f29f26af424]
4 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/ucx/libuct_ib.so.0(+0x80d33) [0x7f29f270fd33]
5 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x7f) [0x7f29f26b1ab5]
6 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x4eb8) [0x7f29f270efc7]
7 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucp.so.0(+0x2bbe0) [0x7f2934606be0]
8 /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucp.so.0(ucp_worker_progress+0x137) [0x7f293460d38d]
9 /home/nfs/bzaitlen/GitRepos/ucx-py/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0xcc61) [0x7f293489fc61]
10 /home/nfs/bzaitlen/GitRepos/ucx-py/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0x11d7a) [0x7f29348a4d7a]
When running the offending test individually the test passes:
(cudf_dev10.1) bzaitlen@dgx15:~/GitRepos/ucx-py$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm py.test -s -v "tests/test_send_recv.py::test_send_recv_numba[<i8-65536]"
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.3, pytest-5.2.0, py-1.8.0, pluggy-0.13.0 -- /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/nfs/bzaitlen/GitRepos/ucx-py/.hypothesis/examples')
rootdir: /home/nfs/bzaitlen/GitRepos/ucx-py
plugins: hypothesis-4.38.1, asyncio-0.10.0
collected 1 item
tests/test_send_recv.py::test_send_recv_numba[<i8-65536] [1569987838.946156] [dgx15:23205:0] parser.c:1568 UCX WARN unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
PASSED
============================================================================================================= 1 passed in 1.27s ==============================================================================================================
@pentschev do you think this is related to some of the failing RDMA behaviors you’ve seen ?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (18 by maintainers)
I can confirm that the tests also passes for @pentschev and I 😃
@pentschev we needed to make
branch-0.10the default branch for CI purpose. The tests are still failing there but wanted to explicitly statebranch-0.10is where all further PRs are being merged not master. Apologies for the inconvenience