ray: Ray cannot access GPUs under a non-root user (failed access of ray.init() to root-owned `/proc/driver/nvidia/gpus`)

What happened + What you expected to happen

I wanted to run ray.init() in a Jupyter Notebook under security-hardened Openshift 3.11 (on RHEL 7.x) on a node with GPUs. The python script was previously tested to work fine on a dev server under plain docker (on Centos Stream 8) on a machine without any GPUs (with default docker capabilities but custom UID).

Error message:

2022-08-23 10:20:32,243	ERROR resource_spec.py:193 -- Could not parse gpu information.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/resource_spec.py", line 189, in resolve
    info_string = _get_gpu_info_string()
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/resource_spec.py", line 351, in _get_gpu_info_string
    gpu_dirs = os.listdir(proc_gpus_path)
PermissionError: [Errno 13] Permission denied: '/proc/driver/nvidia/gpus'
2022-08-23 10:20:32,246	WARNING services.py:1882 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2022-08-23 10:20:32,402	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://<redacted>:8265 

Expected: Remove any code lines reading from root-owned folders, like /proc/driver/nvidia/gpus at least in:

  • _get_gpu_info_string
  • _autodetect_num_gpus

I’m almost certain can find the info you need using NVIDIA utility nvidia-smi (see available info using -h switch).

Versions / Dependencies

$ pip freeze | grep ray
lightgbm-ray==0.1.5
ray==2.0.0
xgboost-ray==0.1.10

Reproduction script

Run ray.init() as a non-root user on a linux machine with a GPU and its appropriate driver installed.

For example using this code snippet:

import ray

ray_cluster_num_cpus = 32
num_gpus = 1

ray_dashboard_host="0.0.0.0" # external
ray_dashboard_port=8265

ray.shutdown()

ray.init(num_cpus=ray_cluster_num_cpus,
         num_gpus=num_gpus,
         include_dashboard=True,
         dashboard_host=ray_dashboard_host,
         dashboard_port=ray_dashboard_port)

This was reproduced in our GPU-enabled Jupyter Notebook container (mirekphd/ml-gpu-py38-cuda112-cust:latest) under Openshift 3.11 (which runs containers under non-root users with random UIDs):

# we are running as user with high ID:
$ id
uid=1000150000(jovyan) gid=100(users) groups=100(users),1000150000

#... but the folder Ray tries to access (/proc/driver/nvidia/gpus) is root-owned
$ ls -lant /proc/driver/
total 0
dr-xr-xr-x.    2 0 0   0 Aug 23 10:22 nvidia-caps
dr-xr-xr-x.    3 0 0   0 Aug 23 10:22 nvidia-nvlink
dr-xr-xr-x.    3 0 0   0 Aug 23 10:22 nvidia-nvswitch
dr-xr-xr-x.    4 0 0   0 Aug 23 10:22 nvidia-uvm
-r--r--r--.    1 0 0   0 Aug 23 10:22 nvram
-r--r--r--.    1 0 0   0 Aug 23 10:22 rtc
dr-xr-xr-x.    3 0 0 120 Aug 23 09:26 nvidia
dr-xr-xr-x.    7 0 0   0 Aug 23 09:26 .
dr-xr-xr-x. 1684 0 0   0 Aug 23 09:26 ..

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (23 by maintainers)

Commits related to this issue

Most upvoted comments

I think we’re concerned about putting a small dependency like GPUtil into ray’s core dependencies, but vendoring the library seems like a reasonable approach.

This would resolve the issue:

See also: https://github.com/ray-project/ray/issues/17914#issuecomment-1236095136

Fortunately this task of parsing nvidia-smi output has been already accomplished in the GPUtil package (see its Github page), which ray already uses for this very purpose, but it fails silently if the package is not installed, without any warnings and installation recommendations (which probably should be included in requirements.txt and definitely described in the docs):

    if importlib.util.find_spec("GPUtil"):
        gpu_list = GPUtil.getGPUs()
        result = len(gpu_list)
    # SILENTLY FAILS ON MISSING GPUtil... 
    # TODO: USE `else` AND PRINT WARNING AND RECOMMEND GPUtil INSTALLATION 
    elif sys.platform.startswith("linux"):
        # TRIES TO ACCESS ROOT-OWNED FOLDER HERE, THUS PRODUCING ERROR DESCRIBED HERE IN #28064
    elif sys.platform == "win32":
    [..]
    return result

[ _autodetect_num_gpus() ]

More info

Here’s more color on how to safely and reliably detect the number of available GPUs, which can be used to improve the existing solution in _autodetect_num_gpus():

>>> import GPUtil

>>> GPUtil.getGPUs()
[<GPUtil.GPUtil.GPU object at 0x7f8ab7907f10>]

>>> gpus = GPUtil.getGPUs()

# a very liberal check, which will always succeed in a physical presence of a GPU, including cards fully utilized by noisy neighbors (with no free VRAM and 100% load)
>>> gpu_avail = GPUtil.getAvailability(gpus, maxLoad=1.0, maxMemory=1.0, includeNan=False, excludeID=[], excludeUUID=[])>>> gpu_avail
[1]

# versus excessively conservative check, which will likely never succeed (there is always some minimal VRAM usage even in headless servers)
>>> gpu_avail = GPUtil.getAvailability(gpus, maxLoad=0.0, maxMemory=0.0, includeNan=False, excludeID=[], excludeUUID=[])
>>> gpu_avail
[0]

Just need to do proper parsing as nvidia-smi is human readable prints.