dask-cuda: Shuffle benchmark fails while getting transfer logs w/ ucx 1.11
The shuffle benchmark seems to be failing with UCX1.11 with the NVLink + no IB configuration. From the looks of the error message the error originates while getting the worker transfer logs rather than the benchmark itself.
Error stacktrace:
Traceback (most recent call last):
File "local_cudf_shuffle.py", line 247, in <module>
main(parse_args())
File "local_cudf_shuffle.py", line 125, in main
incoming_logs = client.run(lambda dask_worker: dask_worker.incoming_transfer_log)
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/client.py", line 2543, in run
return self.sync(self._run, function, *args, **kwargs)
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/client.py", line 859, in sync
return sync(
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
result[0] = yield future
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/client.py", line 2463, in _run
responses = await self.scheduler.broadcast(
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/core.py", line 866, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/core.py", line 667, in send_recv
raise exc.with_traceback(tb)
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/core.py", line 502, in handle_comm
result = await result
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/scheduler.py", line 5587, in broadcast
results = await All(
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/utils.py", line 210, in All
result = await tasks.next()
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/scheduler.py", line 5582, in send_message
resp = await send_recv(comm, close=True, serializers=serializers, **msg)
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/core.py", line 651, in send_recv
response = await comm.read(deserializers=deserializers)
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/distributed/comm/ucx.py", line 300, in read
await self.ep.recv(each_frame)
File "/opt/conda/envs/cugraph_bench/lib/python3.8/site-packages/ucp/core.py", line 704, in recv
ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
ucp.exceptions.UCXError: <[Recv #077] ep: 0x7f69b0d1bb40, tag: 0x34fc41eda976c804, nbytes: 84165, type: <class 'numpy.ndarray'>>: Connection reset by remote peer
Command to run the benchmark:
UCX_MAX_RNDV_RAILS=1 python local_cudf_shuffle.py -t gpu -p ucx --enable-tcp-over-ucx --enable-nvlink --disable-infiniband --disable-rdmacm --runs 20 --in-parts 16 --partition-size 1GB --disable-rmm-pool --no-silence-logs --scheduler-address ucx://...
Note: This doesn’t happen in the UCX+IB configuration case
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (15 by maintainers)
Hey @pentschev! I was able to test this with ucx branch 1.11.x and can confirm the error is fixed.
Thanks for the ping! I’ll test on my end and report back.
@ayushdg thanks for confirming. I’ve been able to test a proper UCX fix for that issue and it seems to resolve it. It’s now being worked to get into upstream UCX and then backported to 1.11, I’ll then let you know so you can verify it on your end as well, I hope it will be ready early next week and would be great to get you testing that still next week to ensure we have no other issues remaining before the 1.11 release scheduler for August.
Happy to test this workaround at larger scales and report the outcome!
Thanks @ayushdg for the details, the reason I was getting OOM was because I forgot to specify I wanted the benchmark to use all the devices. I can reproduce it consistently and without need to separately spin up the cluster with the following command:
I’m reasonably sure the cause for this is indeed https://github.com/openucx/ucx/issues/6922, I was able to write a functional workaround, it simply adds a
await asyncio.sleep(0.1)just before theep.abort()call from UCX-Py. Would you be able to test a branch with those changes in https://github.com/pentschev/distributed/tree/workaround-ucx-connection-reset-by-peer ? I can’t guarantee this is a fully-functionaly workaround, as I imagesleep(0.1)may not be enough if the other endpoint is still receiving large buffers whenUCX.close()is called.