ucx-py: Failing Numba Test on DGX

On DGX15 I see get a failing test when running all the numba tests in succession

UCX_MEMTYPE_CACHE=n UCX_TLS=rc,tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm py.test -s -v tests/test_send_recv.py::test_send_recv_numba
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.3, pytest-5.2.0, py-1.8.0, pluggy-0.13.0 -- /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/nfs/bzaitlen/GitRepos/ucx-py/.hypothesis/examples')
rootdir: /home/nfs/bzaitlen/GitRepos/ucx-py
plugins: hypothesis-4.38.1, asyncio-0.10.0
collected 21 items

tests/test_send_recv.py::test_send_recv_numba[|u1-1] [1569987737.314062] [dgx15:22850:0]         parser.c:1568 UCX  WARN  unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-16] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-256] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-4096] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-65536] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-1048576] PASSED
tests/test_send_recv.py::test_send_recv_numba[|u1-16777216] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-1] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-16] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-256] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-4096] PASSED
tests/test_send_recv.py::test_send_recv_numba[<i8-65536] [dgx15:22850:0:22850] ib_mlx5_log.c:139  Remote access on mlx5_1:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[dgx15:22850:0:22850] ib_mlx5_log.c:139  RC QP 0x648 wqe[3]: RDMA_WRITE s-- [rva 0x7f290fa80000 rkey 0x20cfc] [va 0x7f290fa00000 len 524288 lkey 0x24dad]
==== backtrace ====
    0  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucs.so.0(ucs_fatal_error_message+0xdf) [0x7f2933f5cdce]
    1  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucs.so.0(ucs_log_default_handler+0x159) [0x7f2933f60a2a]
    2  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucs.so.0(ucs_log_dispatch+0xf8) [0x7f2933f60c62]
    3  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x667) [0x7f29f26af424]
    4  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/ucx/libuct_ib.so.0(+0x80d33) [0x7f29f270fd33]
    5  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x7f) [0x7f29f26b1ab5]
    6  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x4eb8) [0x7f29f270efc7]
    7  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucp.so.0(+0x2bbe0) [0x7f2934606be0]
    8  /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/lib/libucp.so.0(ucp_worker_progress+0x137) [0x7f293460d38d]
    9  /home/nfs/bzaitlen/GitRepos/ucx-py/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0xcc61) [0x7f293489fc61]
   10  /home/nfs/bzaitlen/GitRepos/ucx-py/ucp/_libs/core.cpython-37m-x86_64-linux-gnu.so(+0x11d7a) [0x7f29348a4d7a]

When running the offending test individually the test passes:

(cudf_dev10.1) bzaitlen@dgx15:~/GitRepos/ucx-py$ UCX_MEMTYPE_CACHE=n UCX_TLS=rc,tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm py.test -s -v "tests/test_send_recv.py::test_send_recv_numba[<i8-65536]"
============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.7.3, pytest-5.2.0, py-1.8.0, pluggy-0.13.0 -- /home/nfs/bzaitlen/miniconda3/envs/cudf_dev10.1/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/nfs/bzaitlen/GitRepos/ucx-py/.hypothesis/examples')
rootdir: /home/nfs/bzaitlen/GitRepos/ucx-py
plugins: hypothesis-4.38.1, asyncio-0.10.0
collected 1 item

tests/test_send_recv.py::test_send_recv_numba[<i8-65536] [1569987838.946156] [dgx15:23205:0]         parser.c:1568 UCX  WARN  unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
PASSED

============================================================================================================= 1 passed in 1.27s ==============================================================================================================

@pentschev do you think this is related to some of the failing RDMA behaviors you’ve seen ?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

I can confirm that the tests also passes for @pentschev and I 😃

I just tried devel and all tests passed (in all combinations I mentioned in my previous comment). So presumably the issue is within ucx-py’s master branch?

@pentschev we needed to make branch-0.10 the default branch for CI purpose. The tests are still failing there but wanted to explicitly state branch-0.10 is where all further PRs are being merged not master. Apologies for the inconvenience