ucx-py: Running Benchmark on DGX2 fails

I was recently granted access to a DGX2 and ran the new benchmark @madsbk recently added and got a new error (yay!)

ucp.exceptions.UCXError: User-defined limit was reached

Full traceback below:

(cudf_dev101) bzaitlen@exp02:/datasets/bzaitlen/GitRepos/ucx-py$ python benchmarks/local-send-recv.py -o cupy  -n "100MB" --server-dev 1 --client-dev 2 --reuse-alloc --server-address 172.22.1.27
[1574370360.485816] [exp02:63952:0]    ucp_context.c:1004 UCX  ERROR exceeded transports/devices limit (71 requested, up to 64 are supported)
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/datasets/bzaitlen/miniconda3/envs/cudf_dev101/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/datasets/bzaitlen/miniconda3/envs/cudf_dev101/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/datasets/bzaitlen/GitRepos/ucx-py/benchmarks/local-send-recv.py", line 18, in server
    ucp.init()
  File "/datasets/bzaitlen/GitRepos/ucx-py/ucp/public_api.py", line 74, in init
    options, blocking_progress_mode=blocking_progress_mode
  File "ucp/_libs/core.pyx", line 358, in ucp._libs.core.ApplicationContext.__cinit__
    assert_ucs_status(status)
  File "ucp/_libs/core.pyx", line 30, in ucp._libs.core.assert_ucs_status
    raise UCXError(msg)
ucp.exceptions.UCXError: User-defined limit was reached

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

The issue here doesn’t occur anymore with UCX 1.11:

$ python benchmarks/send-recv.py -o cupy  -n "100MB" --server-dev 1 --client-dev 2 --reuse-alloc
Server Running at 10.33.228.80:34024
Client connecting to server at 10.33.228.80:34024
Roundtrip benchmark
--------------------------
n_iter          | 10
n_bytes         | 95.37 MiB
object          | cupy
reuse alloc     | True
transfer API    | TAG
UCX_TLS         | all
UCX_NET_DEVICES | all
==========================
Device(s)       | 1, 2
Average         | 34.95 GiB/s
Median          | 89.54 GiB/s
--------------------------
Iterations
--------------------------
000         | 5.40 GiB/s
001         |81.78 GiB/s
002         |88.06 GiB/s
003         |86.26 GiB/s
004         |89.73 GiB/s
005         |89.35 GiB/s
006         |91.52 GiB/s
007         |94.97 GiB/s
008         |90.95 GiB/s
009         |92.04 GiB/s

Closing.

A workaround is to set the TLS manually e.g.: UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm.

103.25 GB/s isn’t bad 😃

$ UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm UCXPY_IFNAME=enp134s0f1  python local-send-recv.py  -n "100MB" --server-dev 1 --client-dev 2--object_type=cupy --reuse-alloc

Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | cupy
reuse alloc | True
==========================
Device(s)   | 1, 2
Average     | 73.70 GB/s
--------------------------
Iterations
--------------------------
000         | 21.67 GB/s
001         | 84.72 GB/s
002         |103.25 GB/s
003         |106.59 GB/s
004         |102.35 GB/s

And with a message size of 10GB, I get:

--------------------------
n_iter      | 5
n_bytes     | 10.00 GB
object      | cupy
reuse alloc | True
==========================
Device(s)   | 1, 2
Average     | 140.61 GB/s
--------------------------
Iterations
--------------------------
000         |131.20 GB/s
001         |142.95 GB/s
002         |143.00 GB/s
003         |143.29 GB/s
004         |143.45 GB/s