dask-cuda: Shuffle benchmark fails while getting transfer logs w/ ucx 1.11

The shuffle benchmark seems to be failing with UCX1.11 with the NVLink + no IB configuration. From the looks of the error message the error originates while getting the worker transfer logs rather than the benchmark itself.

Error stacktrace:

Traceback (most recent call last):
  File "local_cudf_shuffle.py", line 247, in <module>
    main(parse_args())
  File "local_cudf_shuffle.py", line 125, in main
    incoming_logs = client.run(lambda dask_worker: dask_worker.incoming_transfer_log)
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/client.py", line 2543, in run
    return self.sync(self._run, function, *args, **kwargs)
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/client.py", line 859, in sync
    return sync(
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
    result[0] = yield future
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/client.py", line 2463, in _run
    responses = await self.scheduler.broadcast(
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/core.py", line 866, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/core.py", line 667, in send_recv
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/core.py", line 502, in handle_comm
    result = await result
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/scheduler.py", line 5587, in broadcast
    results = await All(
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/utils.py", line 210, in All
    result = await tasks.next()
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/scheduler.py", line 5582, in send_message
    resp = await send_recv(comm, close=True, serializers=serializers, **msg)
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/core.py", line 651, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/comm/ucx.py", line 300, in read
    await self.ep.recv(each_frame)
  File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/ucp/core.py", line 704, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
ucp.exceptions.UCXError: <[Recv #077] ep: 0x7f69b0d1bb40, tag: 0x34fc41eda976c804, nbytes: 84165, type: <class 'numpy.ndarray'>>: Connection reset by remote peer

Command to run the benchmark:

 UCX_MAX_RNDV_RAILS=1 python local_cudf_shuffle.py -t gpu -p ucx --enable-tcp-over-ucx --enable-nvlink --disable-infiniband --disable-rdmacm --runs 20 --in-parts 16 --partition-size 1GB --disable-rmm-pool --no-silence-logs --scheduler-address ucx://...

Note: This doesn’t happen in the UCX+IB configuration case

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Hey @pentschev! I was able to test this with ucx branch 1.11.x and can confirm the error is fixed.

Thanks for the ping! I’ll test on my end and report back.

@ayushdg thanks for confirming. I’ve been able to test a proper UCX fix for that issue and it seems to resolve it. It’s now being worked to get into upstream UCX and then backported to 1.11, I’ll then let you know so you can verify it on your end as well, I hope it will be ready early next week and would be great to get you testing that still next week to ensure we have no other issues remaining before the 1.11 release scheduler for August.

Happy to test this workaround at larger scales and report the outcome!

Thanks @ayushdg for the details, the reason I was getting OOM was because I forgot to specify I wanted the benchmark to use all the devices. I can reproduce it consistently and without need to separately spin up the cluster with the following command:

$ UCX_MAX_RNDV_RAILS=1 python local_cudf_shuffle.py -t gpu -p ucx --enable-tcp-over-ucx --enable-nvlink --disable-infiniband --disable-rdmacm --runs 20 --in-parts 16 --partition-size 1GB --rmm-pool-size 22G --no-silence-logs -d 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

I’m reasonably sure the cause for this is indeed https://github.com/openucx/ucx/issues/6922, I was able to write a functional workaround, it simply adds a await asyncio.sleep(0.1) just before the ep.abort() call from UCX-Py. Would you be able to test a branch with those changes in https://github.com/pentschev/distributed/tree/workaround-ucx-connection-reset-by-peer ? I can’t guarantee this is a fully-functionaly workaround, as I image sleep(0.1) may not be enough if the other endpoint is still receiving large buffers when UCX.close() is called.