dask-cuda: Spill over after libcudf++ merge is causing CUDA_ERROR_OUT_OF_MEMORY issues

Spill over after libcudf++ merge is causing CUDA_ERROR_OUT_OF_MEMORY issues

After the libcudf++ merge, the spill over mechanism might be failing.

The current hypothesis is, in dask-cuda it looks like if it spills to disk when moving back to the GPU it will allocate via numba instead of RMM.

Relevant code lines are:

From Dask-cuda:

https://github.com/rapidsai/dask-cuda/blob/db07453b130e7fea082279f8bc234f6227718b5b/dask_cuda/device_host_file.py#L90

From distributed (example of how it should be handled ):

https://github.com/dask/distributed/blob/4a8a4f3bce378406e83a099e8a12fc9bc12ef25c/distributed/comm/ucx.py#L45-L63

CC: @jakirkham @pentschev @kkraus14 .

Code to recreate the issue:

https://gist.github.com/VibhuJawa/dbf2573954db86fb193b687022a20f46

Note: I have not run the cleaned up code again on exp01 but the issue should still be there. (Exp-01 was busy)

Stack Trace

ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
distributed.worker - ERROR - [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py", line 2455, in execute
    data[k] = self.data[k]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 152, in __getitem__
    return self.device_buffer[key]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 70, in __getitem__
    return self.slow_to_fast(key)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 57, in slow_to_fast
    value = self.slow[key]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/func.py", line 39, in __getitem__
    return self.load(self.d[key])
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in host_to_device
    frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in <listcomp>
    frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 225, in _require_cuda_context
    return fn(*args, **kws)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/api.py", line 111, in to_device
    to, new = devicearray.auto_device(obj, stream=stream, copy=copy)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 704, in auto_device
    devobj = from_array_like(obj, stream=stream)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 642, in from_array_like
    writeback=ary, stream=stream, gpu_data=gpu_data)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 761, in memalloc
    self._attempt_allocation(allocator)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 751, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb3dd96c410>>, <Task finished coro=<Worker.execute() done, defined at /raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py:2438> exception=CudaAPIError(2, 'Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY')>)
Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/distributed/worker.py", line 2455, in execute
    data[k] = self.data[k]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 152, in __getitem__
    return self.device_buffer[key]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 70, in __getitem__
    return self.slow_to_fast(key)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/buffer.py", line 57, in slow_to_fast
    value = self.slow[key]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/zict/func.py", line 39, in __getitem__
    return self.load(self.d[key])
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in host_to_device
    frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 90, in <listcomp>
    frames = [cuda.to_device(f) if ic else f for ic, f in zip(s.is_cuda, s.parts)]
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 225, in _require_cuda_context
    return fn(*args, **kws)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/api.py", line 111, in to_device
    to, new = devicearray.auto_device(obj, stream=stream, copy=copy)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 704, in auto_device
    devobj = from_array_like(obj, stream=stream)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 642, in from_array_like
    writeback=ary, stream=stream, gpu_data=gpu_data)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/devicearray.py", line 103, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 761, in memalloc
    self._attempt_allocation(allocator)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 751, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
ERROR Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
distributed.worker - ERROR - [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY
Traceback (most recent call last):
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 744, in _attempt_allocation
    allocator()
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 759, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 294, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_12_8_jan/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 329, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 18 (18 by maintainers)

Most upvoted comments

Thanks a lot @VibhuJawa for testing this. I’ll make sure this is merged for 0.12, will leave this issue open until we merge it there.

pentschev on Jan 17, 2020

Did you also test without PR ( #227 ) using the same cuDF versions? Just curious if something in cuDF also affected it.

Yup. Tested on the same above versions .

Just for clarification, @VibhuJawa does that mean it did not work in pure cuDF with the same version (i.e., this PR definitively caused the fix)?

Yup, I believe so.

I tested it on the same environment by just doing a source install of dask-cuda (branch 277).

I.E, It works on below :

# packages in environment at /raid/vjawa/conda_install/conda_env/envs/cudf_12_16_jan:
cudf                      0.12.0b200116         py37_1452    rapidsai-nightly
dask-cudf                 0.12.0b200116         py37_1452    rapidsai-nightly
libcudf                   0.12.0b200116     cuda10.1_1422    rapidsai-nightly
dask-cuda                 0.6.0.dev0+191.g59e1f14          pypi_0    pypi ### source install on this branch

And Fails on below:

# packages in environment at /raid/vjawa/conda_install/conda_env/envs/cudf_12_16_jan:
cudf                      0.12.0b200116         py37_1452    rapidsai-nightly
dask-cudf                 0.12.0b200116         py37_1452    rapidsai-nightly
libcudf                   0.12.0b200116     cuda10.1_1422    rapidsai-nightly
dask-cuda                 0.12.0a200117           py37_47    rapidsai-nightly

VibhuJawa on Jan 17, 2020

@jakirkham , Yup, The issue no longer seems to be present as the workflow works now. Thanks for closing.

VibhuJawa on Jan 30, 2020

This patch should be in the latest nightlies. @VibhuJawa, would you be able to try them and let us know if they are working?

@jakirkham, Sure will update here once i get the time .

VibhuJawa on Jan 21, 2020

Did you also test without PR ( #227 ) using the same cuDF versions? Just curious if something in cuDF also affected it.

Yup. Tested on the same above versions .

Just for clarification, @VibhuJawa does that mean it did not work in pure cuDF with the same version (i.e., this PR definitively caused the fix)?

beckernick on Jan 17, 2020

@pentschev , I tested #227 and it works now successfully. Thanks a lot for your work on this and sorry for the delay in testing.

Tested on below Cudf versions (for record keeping) :

cudf                      0.12.0b200116         py37_1452    rapidsai-nightly
dask-cudf                 0.12.0b200116         py37_1452    rapidsai-nightly
libcudf                   0.12.0b200116     cuda10.1_1422    rapidsai-nightly

VibhuJawa on Jan 17, 2020