ucx-py: [BUG] Error while starting dask-client with ucx on dgx-2
[BUG] Error while starting dask-client with ucx on dgx-2
I am trying to start a dask-client with ucx on a dgx-2 but i get the following error.
select.c:438 UCX ERROR no active messages transport to <no debug data>: Unsupported operation
Commands to start cluster+workers
UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm python -m distributed.cli.dask_scheduler --interface enp134s0f1 --protocol ucx
UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipX_SOCKADDR_TLS_PRIORITY=sockcm dask-cuda-worker ucx://172.22.1.27:8786
Initializing the dask-client.
%env UCX_TLS=sockcm,cuda_copy,cuda_ipc
%env UCX_SOCKADDR_TLS_PRIORITY=sockcm
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
client = Client('ucx://172.22.1.27:8786')
client
Full error Trace
distributed.core - INFO - Starting established connection
[1574100076.686698] [exp02:84896:0] select.c:438 UCX ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100076.686720] [exp02:84896:0] ucp_listener.c:122 UCX ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
[1574100076.687523] [exp02:84896:0] select.c:438 UCX ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100076.687531] [exp02:84896:0] ucp_listener.c:122 UCX ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
[1574100077.689709] [exp02:84896:0] select.c:438 UCX ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100077.689724] [exp02:84896:0] ucp_listener.c:122 UCX ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
[1574100078.691843] [exp02:84896:0] select.c:438 UCX ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100078.691856] [exp02:84896:0] ucp_listener.c:122 UCX ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
[1574100079.694047] [exp02:84896:0] select.c:438 UCX ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100079.694061] [exp02:84896:0] ucp_listener.c:122 UCX ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
Follow up question
What is the best process to choose the network interface to run ucx on ?
For example in dgx-2 i have the following interfaces with ipv-4 addresses of which i guessed enp134s0f1 was the one i wanted. But there are also interfaces for ib0-ib7 .
IP address for enp134s0f1: 172.22.1.27
IP address for ib0: 10.33.228.80
IP address for ib1: 10.33.228.81
IP address for ib2: 10.33.228.82
IP address for ib3: 10.33.228.83
IP address for ib4: 10.33.228.84
IP address for ib5: 10.33.228.85
IP address for ib6: 10.33.228.86
IP address for ib7: 10.33.228.87
IP address for docker0: 172.17.0.1
Environment Details:
Ucx Packages
ucx 1.7.0rc1 hd6f8bf8_2 conda-forge/label/rc_ucx
ucx-proc 1.0.0 gpu conda-forge
ucx-py 0.2 py37hd6f8bf8_2 conda-forge/label/rc_ucx
Dask packages
dask 2.8.0 py_1 conda-forge
dask-core 2.8.0 py_0 conda-forge
dask-cuda 0.11.0b191116 py37_2 rapidsai-nightly
dask-cudf 0.11.0a191116 py37_2844 rapidsai-nightly
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (16 by maintainers)
@jakirkham , This was fixed. I don’t see the issues now. Closing the issue.
There are no dumb questions @VibhuJawa , we’re all learning together here, and there’s no documentation for most of these things so there’s really no way other than asking. 😃
Interesting, it looks like
mlx5_4andmlx5_5are not close to any of the GPUs. That definitely breaks the assumption made in a large part of the dask-cuda code, as per my previous comment. We have to find a way to identify the closest IB interfaces automatically, but currently we don’t have a way to do that.Regarding NVLink, what I noticed is that NVLink doesn’t perform well when transferring small chunks of data (< 1GB). Can you confirm what are the sizes that are being transferred at the moment?
Also, what is there error you see with
dask-cuda-worker ... --enable-nvlink?