ucx-py: [BUG] Error while starting dask-client with ucx on dgx-2

[BUG] Error while starting dask-client with ucx on dgx-2

I am trying to start a dask-client with ucx on a dgx-2 but i get the following error.

select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation

Commands to start cluster+workers

UCX_MEMTYPE_CACHE=n UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipc UCX_SOCKADDR_TLS_PRIORITY=sockcm python -m distributed.cli.dask_scheduler --interface enp134s0f1  --protocol ucx

UCX_TLS=tcp,sockcm,cuda_copy,cuda_ipX_SOCKADDR_TLS_PRIORITY=sockcm dask-cuda-worker ucx://172.22.1.27:8786

Initializing the dask-client.

%env UCX_TLS=sockcm,cuda_copy,cuda_ipc
%env UCX_SOCKADDR_TLS_PRIORITY=sockcm


from dask_cuda import LocalCUDACluster
from dask.distributed import Client

client = Client('ucx://172.22.1.27:8786')
client

Full error Trace

distributed.core - INFO - Starting established connection
[1574100076.686698] [exp02:84896:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100076.686720] [exp02:84896:0]   ucp_listener.c:122  UCX  ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
[1574100076.687523] [exp02:84896:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100076.687531] [exp02:84896:0]   ucp_listener.c:122  UCX  ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
[1574100077.689709] [exp02:84896:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100077.689724] [exp02:84896:0]   ucp_listener.c:122  UCX  ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
[1574100078.691843] [exp02:84896:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100078.691856] [exp02:84896:0]   ucp_listener.c:122  UCX  ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable
[1574100079.694047] [exp02:84896:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
[1574100079.694061] [exp02:84896:0]   ucp_listener.c:122  UCX  ERROR connection request failed on listener 0x55a98dc77950 with status Destination is unreachable

Follow up question

What is the best process to choose the network interface to run ucx on ?

For example in dgx-2 i have the following interfaces with ipv-4 addresses of which i guessed enp134s0f1 was the one i wanted. But there are also interfaces for ib0-ib7 .

  IP address for enp134s0f1: 172.22.1.27
  IP address for ib0:        10.33.228.80
  IP address for ib1:        10.33.228.81
  IP address for ib2:        10.33.228.82
  IP address for ib3:        10.33.228.83
  IP address for ib4:        10.33.228.84
  IP address for ib5:        10.33.228.85
  IP address for ib6:        10.33.228.86
  IP address for ib7:        10.33.228.87
  IP address for docker0:    172.17.0.1

Environment Details:

Ucx Packages

ucx                       1.7.0rc1             hd6f8bf8_2    conda-forge/label/rc_ucx
ucx-proc                  1.0.0                       gpu    conda-forge
ucx-py                    0.2              py37hd6f8bf8_2    conda-forge/label/rc_ucx

Dask packages

dask                      2.8.0                      py_1    conda-forge
dask-core                 2.8.0                      py_0    conda-forge
dask-cuda                 0.11.0b191116            py37_2    rapidsai-nightly
dask-cudf                 0.11.0a191116         py37_2844    rapidsai-nightly

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

@jakirkham , This was fixed. I don’t see the issues now. Closing the issue.

There are no dumb questions @VibhuJawa , we’re all learning together here, and there’s no documentation for most of these things so there’s really no way other than asking. 😃

Interesting, it looks like mlx5_4 and mlx5_5 are not close to any of the GPUs. That definitely breaks the assumption made in a large part of the dask-cuda code, as per my previous comment. We have to find a way to identify the closest IB interfaces automatically, but currently we don’t have a way to do that.

Regarding NVLink, what I noticed is that NVLink doesn’t perform well when transferring small chunks of data (< 1GB). Can you confirm what are the sizes that are being transferred at the moment?

Also, what is there error you see with dask-cuda-worker ... --enable-nvlink?