rmm: [BUG] unhandled exn for cpu buildbots

Edit: Apparently our patch of dask_sql to avoid CVEs unexpectedly upgraded our rapids libs to 2022.04 , even though the base image was 2022.02; will update base to 2022.04 as well


Describe the bug

cudf imports now fail on cpu buildbots because rmm now throws RuntimeError instead of the cudf-expected CUDARuntimeError during getDeviceCount()

This is a problem b/c scenarios like buildbots are often CPU, so this unexpected change breaking import cudf is also breaking downstream dependencies like import dask_sql (which should work in CPU).

cudf hasn’t changed here in ~3 years, but it calls rmm, which has changed in https://github.com/rapidsai/rmm/commit/d94bdfd060c8c54379d01c21b8386492f36c9fd1

In particular, instead of a status for a managed exn, there’s a RuntimeError during cudaGetDeviceCount():

def getDeviceCount():
    """
    Returns the number of devices with compute capability greater or
    equal to 2.0 that are available for execution.
    This function automatically raises CUDARuntimeError with error message
    and status code.
    """
    status, count = cudart.cudaGetDeviceCount()
    if status != cudart.cudaError_t.cudaSuccess:
        raise CUDARuntimeError(status)
    return 

Steps/Code to reproduce bug

On a CPU box, try something like docker run --rf -it graphistry/graphistry-forge-etl-python:latest python -c "import cudf"

/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_sql/input_utils/dask.py:8: in <module>
    import dask_cudf
/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_cudf/__init__.py:5: in <module>
    import cudf
/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/__init__.py:5: in <module>
    validate_setup()
/opt/conda/envs/rapids/lib/python3.8/site-packages/cudf/utils/gpu_utils.py:52: in validate_setup
    gpus_count = getDeviceCount()
/opt/conda/envs/rapids/lib/python3.8/site-packages/rmm/_cuda/gpu.py:99: in getDeviceCount
    status, count = cudart.cudaGetDeviceCount()
cuda/cudart.pyx:8141: in cuda.cudart.cudaGetDeviceCount
    ???
cuda/ccudart.pyx:486: in cuda.ccudart.cudaGetDeviceCount
    ???
cuda/_lib/ccudart/ccudart.pyx:1463: in cuda._lib.ccudart.ccudart._cudaGetDeviceCount
    ???
cuda/_cuda/ccuda.pyx:3583: in cuda._cuda.ccuda._cuDeviceGetCount
    ???
E   RuntimeError: Function "cuDeviceGetCount" not found

Expected behavior

raise CUDARuntimeError instead of RuntimeError

Environment details (please complete the following information):

github cpu ubuntu runner with a graphistry gpu container whose base is rapids 2022.02.1 runtime

Additional context

https://rapids-goai.slack.com/archives/C5E06F4DC/p1649798365680969

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Thanks! We’ll probably be swapping in closer to EOW as we’re getting our 22.04 enterprise release out first 😃

Agreed wrt diagnosis. Tricky!

@shwina nightlies fix is ok for us, our impact is just on our CI CPU bots

To clarify, I believe the issue is:

  • being able to import dask_sql
  • on a CPU machine that doesn’t have the CUDA drivers (libcuda.so) installed
  • and has cudf installed

In particular, dask_sql does something along the lines of:

try:
    import cudf
except ImportError:
    cudf = None

Previously, import cudf on a machine without libcuda.so would raise an ImportError, while today it raises a RuntimeError.

In short, this is due to the switch to CUDA Python, which throws a RuntimeError when it fails to dlopen libcuda.so.

Prior to the use of CUDA Python, we got an ImportError instead because RMM extension modules would link to libcuda.so (missing shared lib depdendencies in extension modules manifest as ImportError).


https://github.com/rapidsai/cudf/pull/10653 should address this issue by handling the RuntimeError. This allows an ImportError to eventually be thrown when loading cuDF’s extension modules, which dask_sql will correctly catch.

Edit: note that none of this is ideal. That import cudf will raise an ImportError on CPU-only machines is still a bit of an implementation detail and not a promise by any means. The ideal scenario is being able to import cudf succesfully on CPU machines.


@lmeyerov - would it be sufficient for you if this issue was fixed in the nightlies (and not a patch release?)

@shwina 's rec to update import cudf would work for our immediate case

in case other libs also have the same issue (am not sure), a tighter fix would be for rmm to preserve its previous init exn behavior: put a try/except around cudart.cudaGetDeviceCount() and throw CUDARuntimeError there too, vs the new RuntimeError