dask-cuda: Spill over after libcudf++ merge is causing CUDA_ERROR_OUT_OF_MEMORY issues

Spill over after libcudf++ merge is causing CUDA_ERROR_OUT_OF_MEMORY issues

After the libcudf++ merge, the spill over mechanism might be failing.

The current hypothesis is, in dask-cuda it looks like if it spills to disk when moving back to the GPU it will allocate via numba instead of RMM.

Relevant code lines are:

From Dask-cuda:

https://github.com/rapidsai/dask-cuda/blob/db07453b130e7fea082279f8bc234f6227718b5b/dask_cuda/device_host_file.py#L90

From distributed (example of how it should be handled ):

https://github.com/dask/distributed/blob/4a8a4f3bce378406e83a099e8a12fc9bc12ef25c/distributed/comm/ucx.py#L45-L63

CC: @jakirkham @pentschev @kkraus14 .

Code to recreate the issue:

https://gist.github.com/VibhuJawa/dbf2573954db86fb193b687022a20f46

Note: I have not run the cleaned up code again on exp01 but the issue should still be there. (Exp-01 was busy)

Stack Trace

ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
distributed.worker - ERROR - [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py", line 2455, in execute
    data[k] = self.data[k]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 152, in __getitem__
    return self.device_buffer[key]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 70, in __getitem__
    return self.slow_to_fast(key)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 57, in slow_to_fast
    value = self.slow[key]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/func.py", line 39, in __getitem__
    return self.load(self.d[key])
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in host_to_device
    frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in <listcomp>
    frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 225, in _require_cuda_context
    return fn(*args, **kws)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/api.py", line 111, in to_device
    to, new = devicearray.auto_device(obj, stream=stream, copy=copy)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 704, in auto_device
    devobj = from_array_like(obj, stream=stream)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 642, in from_array_like
    writeback=ary, stream=stream, gpu_data=gpu_data)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 761, in memalloc
    self._attempt_allocation(allocator)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 751, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb3dd96c410>>, <Task finished coro=<Worker.execute() done, defined at /raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py:2438> exception=CudaAPIError(2, 'Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY')>)
Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py", line 2455, in execute
    data[k] = self.data[k]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 152, in __getitem__
    return self.device_buffer[key]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 70, in __getitem__
    return self.slow_to_fast(key)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 57, in slow_to_fast
    value = self.slow[key]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/func.py", line 39, in __getitem__
    return self.load(self.d[key])
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in host_to_device
    frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in <listcomp>
    frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 225, in _require_cuda_context
    return fn(*args, **kws)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/api.py", line 111, in to_device
    to, new = devicearray.auto_device(obj, stream=stream, copy=copy)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 704, in auto_device
    devobj = from_array_like(obj, stream=stream)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 642, in from_array_like
    writeback=ary, stream=stream, gpu_data=gpu_data)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 761, in memalloc
    self._attempt_allocation(allocator)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 751, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
distributed.worker - ERROR - [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

Thanks a lot @VibhuJawa for testing this. I’ll make sure this is merged for 0.12, will leave this issue open until we merge it there.

Did you also test without PR ( #227 ) using the same cuDF versions? Just curious if something in cuDF also affected it.

Yup. Tested on the same above versions .

Just for clarification, @VibhuJawa does that mean it did not work in pure cuDF with the same version (i.e., this PR definitively caused the fix)?

Yup, I believe so.

I tested it on the same environment by just doing a source install of dask-cuda (branch 277).

I.E, It works on below :

# packages in environment at /raid/vjawa/conda_install/conda_env/envs/cudf_12_16_jan:
cudf                      0.12.0b200116         py37_1452    rapidsai-nightly
dask-cudf                 0.12.0b200116         py37_1452    rapidsai-nightly
libcudf                   0.12.0b200116     cuda10.1_1422    rapidsai-nightly
dask-cuda                 0.6.0.dev0+191.g59e1f14          pypi_0    pypi ### source install on this branch

And Fails on below:

# packages in environment at /raid/vjawa/conda_install/conda_env/envs/cudf_12_16_jan:
cudf                      0.12.0b200116         py37_1452    rapidsai-nightly
dask-cudf                 0.12.0b200116         py37_1452    rapidsai-nightly
libcudf                   0.12.0b200116     cuda10.1_1422    rapidsai-nightly
dask-cuda                 0.12.0a200117           py37_47    rapidsai-nightly

@jakirkham , Yup, The issue no longer seems to be present as the workflow works now. Thanks for closing.

This patch should be in the latest nightlies. @VibhuJawa, would you be able to try them and let us know if they are working?

@jakirkham, Sure will update here once i get the time .

Did you also test without PR ( #227 ) using the same cuDF versions? Just curious if something in cuDF also affected it.

Yup. Tested on the same above versions .

Just for clarification, @VibhuJawa does that mean it did not work in pure cuDF with the same version (i.e., this PR definitively caused the fix)?

@pentschev , I tested #227 and it works now successfully. Thanks a lot for your work on this and sorry for the delay in testing.

Tested on below Cudf versions (for record keeping) :

cudf                      0.12.0b200116         py37_1452    rapidsai-nightly
dask-cudf                 0.12.0b200116         py37_1452    rapidsai-nightly
libcudf                   0.12.0b200116     cuda10.1_1422    rapidsai-nightly