ucx-py: Failure with RMM (post CNMeM deprecation)
Have been seeing the following failures on CI and locally with the benchmarks since PR ( https://github.com/rapidsai/rmm/pull/466 ) was merged. A fair bit of history is documented in PR ( https://github.com/rapidsai/ucx-py/pull/575 ), a style PR, where this first came up. Though it can be reproduced outside of that PR.
Tried specifying the memory pool size to be close to the amount of memory on the GPU (with some space still free for other miscellaneous usage). However that still fails. Tried a pool size of 0, which also fails. Changing the chunk size also fails. All with the same error.
In the send/recv case using different devices for server and client avoids the issue. Also using the CuPy memory pool instead of RMM’s works fine in send/recv.
At a bit of a loss for what is happening here, would be great to have another set of eyes 🙂
Dataframe merge benchmark failure:
python benchmarks/cudf-merge.py --chunks-per-dev 4 --chunk-size 10000
Process SpawnProcess-1:
Traceback (most recent call last):
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/ucp/utils.py", line 163, in _worker_process
ret = loop.run_until_complete(run())
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/ucp/utils.py", line 158, in run
return await func(rank, eps, args)
File "/datasets/jkirkham/devel/ucx-py/benchmarks/cudf-merge.py", line 169, in worker
df1 = generate_chunk(rank, args.chunk_size, args.n_chunks, "build", args.frac_match)
File "/datasets/jkirkham/devel/ucx-py/benchmarks/cudf-merge.py", line 114, in generate_chunk
"key": cupy.arange(start, stop=stop, dtype="int64"),
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/cupy/creation/ranges.py", line 55, in arange
ret = cupy.empty((size,), dtype=dtype)
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/cupy/creation/basic.py", line 22, in empty
return cupy.ndarray(shape, dtype, order=order)
File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/rmm/rmm.py", line 270, in rmm_cupy_allocator
buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
File "rmm/_lib/device_buffer.pyx", line 70, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp68: cudaErrorMemoryAllocation out of memory
Send/Recv benchmark failure:
python benchmarks/local-send-recv.py -o rmm --server-dev 0 --client-dev 0 --reuse-alloc
Server Running at 10.33.225.165:48726
Client connecting to server at 10.33.225.165:48726
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<_listener_handler_coroutine() done, defined at /datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/ucp/core.py:144> exception=MemoryError('std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp68: cudaErrorMemoryAllocation out of memory')>
Traceback (most recent call last):
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/ucp/core.py", line 196, in _listener_handler_coroutine
await func(ep)
File "/datasets/jkirkham/devel/ucx-py/benchmarks/local-send-recv.py", line 68, in server_handler
t = np.zeros(args.n_bytes, dtype="u1")
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/cupy/creation/basic.py", line 204, in zeros
a = cupy.ndarray(shape, dtype, order=order)
File "cupy/core/core.pyx", line 134, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 544, in cupy.cuda.memory.alloc
File "/datasets/jkirkham/miniconda/envs/rapids15dev/lib/python3.8/site-packages/rmm/rmm.py", line 270, in rmm_cupy_allocator
buf = librmm.device_buffer.DeviceBuffer(size=nbytes)
File "rmm/_lib/device_buffer.pyx", line 70, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: ../include/rmm/mr/device/cuda_memory_resource.hpp68: cudaErrorMemoryAllocation out of memory
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (17 by maintainers)
Actually even if you only create 2 workers this is likely to be a problem. Because there’s a good chance that 2 processes both trying to allocate 1/2 of available GPU memory will not both succeed, due to per-process / per-context overhead.
I would use
1/(n+1)fornworkers to be safe. Or a less conservative option would be(available_memory/n) - X, where X is some number of bytes, like 100MiB.