ucx-py: Failure with RMM (post CNMeM deprecation)

Have been seeing the following failures on CI and locally with the benchmarks since PR ( https://github.com/rapidsai/rmm/pull/466 ) was merged. A fair bit of history is documented in PR ( https://github.com/rapidsai/ucx-py/pull/575 ), a style PR, where this first came up. Though it can be reproduced outside of that PR.

Tried specifying the memory pool size to be close to the amount of memory on the GPU (with some space still free for other miscellaneous usage). However that still fails. Tried a pool size of 0, which also fails. Changing the chunk size also fails. All with the same error.

In the send/recv case using different devices for server and client avoids the issue. Also using the CuPy memory pool instead of RMM’s works fine in send/recv.

At a bit of a loss for what is happening here, would be great to have another set of eyes 🙂

Dataframe merge benchmark failure:
python benchmarks/cudf-merge.py --chunks-per-dev 4 --chunk-size 10000
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/ucp/utils.py", line 163, in _worker_process
    ret = loop.run_until_complete(run())
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/ucp/utils.py", line 158, in run
    return await func(rank, eps, args)
  File "/datasets/jkirkham/devel/ucx-py/benchmarks/cudf-merge.py", line 169, in worker
    df1 = generate_chunk(rank, args.chunk_size, args.n_chunks, "build", args.frac_match)
  File "/datasets/jkirkham/devel/ucx-py/benchmarks/cudf-merge.py", line 114, in generate_chunk
    "key": cupy.arange(start, stop=stop, dtype="int64"),
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/cupy/creation/ranges.py", line 55, in arange
    ret = cupy.empty((size,), dtype=dtype)
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/cupy/creation/basic.py", line 22, in empty
    return cupy.ndarray(shape, dtype, order=order)
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/rmm/rmm.py", line 270, in rmm_cupy_allocator
    buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
  File "rmm/_lib/device_buffer.pyx", line 70, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp68: cudaErrorMemoryAllocation out of memory
Send/Recv benchmark failure:
python benchmarks/local-send-recv.py -o rmm --server-dev 0 --client-dev 0 --reuse-alloc
Server Running at 10.33.225.165:48726
Client connecting to server at 10.33.225.165:48726
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<_listener_handler_coroutine() done, defined at /datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/ucp/core.py:144> exception=MemoryError('std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp68: cudaErrorMemoryAllocation out of memory')>
Traceback (most recent call last):
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/ucp/core.py", line 196, in _listener_handler_coroutine
    await func(ep)
  File "/datasets/jkirkham/devel/ucx-py/benchmarks/local-send-recv.py", line 68, in server_handler
    t = np.zeros(args.n_bytes, dtype="u1")
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/cupy/creation/basic.py", line 204, in zeros
    a = cupy.ndarray(shape, dtype, order=order)
  File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
  File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/rmm/rmm.py", line 270, in rmm_cupy_allocator
    buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
  File "rmm/_lib/device_buffer.pyx", line 70, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp68: cudaErrorMemoryAllocation out of memory

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

As a result if we create more than 2 workers, they can try to allocate more memory than is available on the device.

Actually even if you only create 2 workers this is likely to be a problem. Because there’s a good chance that 2 processes both trying to allocate 1/2 of available GPU memory will not both succeed, due to per-process / per-context overhead.

I would use 1/(n+1) for n workers to be safe. Or a less conservative option would be (available_memory/n) - X, where X is some number of bytes, like 100MiB.