ucx-py: Running Benchmark on DGX2 fails
I was recently granted access to a DGX2 and ran the new benchmark @madsbk recently added and got a new error (yay!)
ucp.exceptions.UCXError: User-defined limit was reached
Full traceback below:
(cudf_dev101) bzaitlen@exp02:/datasets/bzaitlen/GitRepos/ucx-py$ python benchmarks/local-send-recv.py -o cupy -n "100MB" --server-dev 1 --client-dev 2 --reuse-alloc --server-address 172.22.1.27
[1574370360.485816] [exp02:63952:0] ucp_context.c:1004 UCX ERROR exceeded transports/devices limit (71 requested, up to 64 are supported)
Process SpawnProcess-1:
Traceback (most recent call last):
File "/datasets/bzaitlen/miniconda3/envs/cudf_dev101/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/datasets/bzaitlen/miniconda3/envs/cudf_dev101/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/datasets/bzaitlen/GitRepos/ucx-py/benchmarks/local-send-recv.py", line 18, in server
ucp.init()
File "/datasets/bzaitlen/GitRepos/ucx-py/ucp/public_api.py", line 74, in init
options, blocking_progress_mode=blocking_progress_mode
File "ucp/_libs/core.pyx", line 358, in ucp._libs.core.ApplicationContext.__cinit__
assert_ucs_status(status)
File "ucp/_libs/core.pyx", line 30, in ucp._libs.core.assert_ucs_status
raise UCXError(msg)
ucp.exceptions.UCXError: User-defined limit was reached
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (15 by maintainers)
The issue here doesn’t occur anymore with UCX 1.11:
Closing.
A workaround is to set the TLS manually e.g.:
UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm.103.25 GB/s isn’t bad 😃
And with a message size of 10GB, I get: