rmm: [BUG] unhandled exn for cpu buildbots
Edit: Apparently our patch of dask_sql to avoid CVEs unexpectedly upgraded our rapids libs to 2022.04 , even though the base image was 2022.02; will update base to 2022.04 as well
Describe the bug
cudf imports now fail on cpu buildbots because rmm now throws RuntimeError instead of the cudf-expected CUDARuntimeError during getDeviceCount()
This is a problem b/c scenarios like buildbots are often CPU, so this unexpected change breaking import cudf is also breaking downstream dependencies like import dask_sql (which should work in CPU).
cudf hasn’t changed here in ~3 years, but it calls rmm, which has changed in https://github.com/rapidsai/rmm/commit/d94bdfd060c8c54379d01c21b8386492f36c9fd1
In particular, instead of a status for a managed exn, there’s a RuntimeError during cudaGetDeviceCount():
def getDeviceCount():
"""
Returns the number of devices with compute capability greater or
equal to 2.0 that are available for execution.
This function automatically raises CUDARuntimeError with error message
and status code.
"""
status, count = cudart.cudaGetDeviceCount()
if status != cudart.cudaError_t.cudaSuccess:
raise CUDARuntimeError(status)
return
Steps/Code to reproduce bug
On a CPU box, try something like docker run --rf -it graphistry/graphistry-forge-etl-python:latest python -c "import cudf"
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_sql/input_utils/dask.py:8: in <module>
import dask_cudf
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cudf/__init__.py:5: in <module>
import cudf
/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/__init__.py:5: in <module>
validate_setup()
/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/utils/gpu_utils.py:52: in validate_setup
gpus_count = getDeviceCount()
/opt/conda/envs/rapids/lib/python3.8/site-packages/rmm/_cuda/gpu.py:99: in getDeviceCount
status, count = cudart.cudaGetDeviceCount()
cuda/cudart.pyx:8141: in cuda.cudart.cudaGetDeviceCount
???
cuda/ccudart.pyx:486: in cuda.ccudart.cudaGetDeviceCount
???
cuda/_lib/ccudart/ccudart.pyx:1463: in cuda._lib.ccudart.ccudart._cudaGetDeviceCount
???
cuda/_cuda/ccuda.pyx:3583: in cuda._cuda.ccuda._cuDeviceGetCount
???
E RuntimeError: Function "cuDeviceGetCount" not found
Expected behavior
raise CUDARuntimeError instead of RuntimeError
Environment details (please complete the following information):
github cpu ubuntu runner with a graphistry gpu container whose base is rapids 2022.02.1 runtime
Additional context
https://rapids-goai.slack.com/archives/C5E06F4DC/p1649798365680969
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (7 by maintainers)
Thanks! We’ll probably be swapping in closer to EOW as we’re getting our
22.04enterprise release out first 😃Agreed wrt diagnosis. Tricky!
@shwina nightlies fix is ok for us, our impact is just on our CI CPU bots
To clarify, I believe the issue is:
import dask_sqllibcuda.so) installedcudfinstalledIn particular,
dask_sqldoes something along the lines of:Previously,
import cudfon a machine withoutlibcuda.sowould raise anImportError, while today it raises aRuntimeError.In short, this is due to the switch to CUDA Python, which throws a RuntimeError when it fails to
dlopen libcuda.so.Prior to the use of CUDA Python, we got an
ImportErrorinstead because RMM extension modules would link tolibcuda.so(missing shared lib depdendencies in extension modules manifest asImportError).https://github.com/rapidsai/cudf/pull/10653 should address this issue by handling the
RuntimeError. This allows anImportErrorto eventually be thrown when loading cuDF’s extension modules, whichdask_sqlwill correctly catch.Edit: note that none of this is ideal. That
import cudfwill raise anImportErroron CPU-only machines is still a bit of an implementation detail and not a promise by any means. The ideal scenario is being able toimport cudfsuccesfully on CPU machines.@lmeyerov - would it be sufficient for you if this issue was fixed in the nightlies (and not a patch release?)
@shwina 's rec to update
import cudfwould work for our immediate casein case other libs also have the same issue (am not sure), a tighter fix would be for
rmmto preserve its previous init exn behavior: put atry/exceptaroundcudart.cudaGetDeviceCount()and throwCUDARuntimeErrorthere too, vs the newRuntimeError