ray: Ray cannot access GPUs under a non-root user (failed access of ray.init() to root-owned `/proc/driver/nvidia/gpus`)
What happened + What you expected to happen
I wanted to run ray.init() in a Jupyter Notebook under security-hardened Openshift 3.11 (on RHEL 7.x) on a node with GPUs. The python script was previously tested to work fine on a dev server under plain docker (on Centos Stream 8) on a machine without any GPUs (with default docker capabilities but custom UID).
Error message:
2022-08-23 10:20:32,243 ERROR resource_spec.py:193 -- Could not parse gpu information.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/ray/_private/resource_spec.py", line 189, in resolve
info_string = _get_gpu_info_string()
File "/opt/conda/lib/python3.8/site-packages/ray/_private/resource_spec.py", line 351, in _get_gpu_info_string
gpu_dirs = os.listdir(proc_gpus_path)
PermissionError: [Errno 13] Permission denied: '/proc/driver/nvidia/gpus'
2022-08-23 10:20:32,246 WARNING services.py:1882 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2022-08-23 10:20:32,402 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://<redacted>:8265
Expected:
Remove any code lines reading from root-owned folders, like /proc/driver/nvidia/gpus at least in:
_get_gpu_info_string_autodetect_num_gpus
I’m almost certain can find the info you need using NVIDIA utility nvidia-smi (see available info using -h switch).
Versions / Dependencies
$ pip freeze | grep ray
lightgbm-ray==0.1.5
ray==2.0.0
xgboost-ray==0.1.10
Reproduction script
Run ray.init() as a non-root user on a linux machine with a GPU and its appropriate driver installed.
For example using this code snippet:
import ray
ray_cluster_num_cpus = 32
num_gpus = 1
ray_dashboard_host="0.0.0.0" # external
ray_dashboard_port=8265
ray.shutdown()
ray.init(num_cpus=ray_cluster_num_cpus,
num_gpus=num_gpus,
include_dashboard=True,
dashboard_host=ray_dashboard_host,
dashboard_port=ray_dashboard_port)
This was reproduced in our GPU-enabled Jupyter Notebook container (mirekphd/ml-gpu-py38-cuda112-cust:latest) under Openshift 3.11 (which runs containers under non-root users with random UIDs):
# we are running as user with high ID:
$ id
uid=1000150000(jovyan) gid=100(users) groups=100(users),1000150000
#... but the folder Ray tries to access (/proc/driver/nvidia/gpus) is root-owned
$ ls -lant /proc/driver/
total 0
dr-xr-xr-x. 2 0 0 0 Aug 23 10:22 nvidia-caps
dr-xr-xr-x. 3 0 0 0 Aug 23 10:22 nvidia-nvlink
dr-xr-xr-x. 3 0 0 0 Aug 23 10:22 nvidia-nvswitch
dr-xr-xr-x. 4 0 0 0 Aug 23 10:22 nvidia-uvm
-r--r--r--. 1 0 0 0 Aug 23 10:22 nvram
-r--r--r--. 1 0 0 0 Aug 23 10:22 rtc
dr-xr-xr-x. 3 0 0 120 Aug 23 09:26 nvidia
dr-xr-xr-x. 7 0 0 0 Aug 23 09:26 .
dr-xr-xr-x. 1684 0 0 0 Aug 23 09:26 ..
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 29 (23 by maintainers)
Commits related to this issue
- Further Ray improvements * Improves error handling when Ray head is not running * Starts up Ray with the --num-gpus flag to avoid https://github.com/ray-project/ray/issues/28064 in certain CML configu... — committed to ma1112/cmlextensions by deleted user a year ago
- Adds the --num-gpus flag to ray commands to avoid ray-project/ray#28064 in certain CML configurations. — committed to ma1112/cmlextensions by deleted user a year ago
- Adds the --num-gpus flag to ray commands to avoid ray-project/ray#28064 in certain CML configurations. (#10) Co-authored-by: Arpad Marinovszki <amarinovszki@cloudera.com> — committed to cloudera/cmlextensions by ma1112 a year ago
This would resolve the issue:
See also: https://github.com/ray-project/ray/issues/17914#issuecomment-1236095136
Fortunately this task of parsing
nvidia-smioutput has been already accomplished in theGPUtilpackage (see its Github page), whichrayalready uses for this very purpose, but it fails silently if the package is not installed, without any warnings and installation recommendations (which probably should be included in requirements.txt and definitely described in the docs):[ _autodetect_num_gpus() ]
More info
Here’s more color on how to safely and reliably detect the number of available GPUs, which can be used to improve the existing solution in
_autodetect_num_gpus():