ray: [Rllib] `TorchPolicy` and `TFPolicy` cannot find any GPUs

What is the problem?

ray.get_gpu_ids() gets an empty list on my machine when I’m using TorchPolicy with config['num_gpu'] set. It will get an IndexError at self.devices[0] when using TorchPolicy on GPUs:

https://github.com/ray-project/ray/blob/1f35470560c90de69e7555097a4d2dd85065d6f8/rllib/policy/torch_policy.py#L154-L159

This issue can be reproduced on multiple machines. Ray version and other system information (Python version, TensorFlow version, OS):

My Runtime Environment

Machine 1:

OS version: Ubuntu 20.04 LTS
Python version: 3.8.10
Ray version: 1.5.0 from PyPI (tested with nightly build as well)
PyTorch version: 1.9.0
NVIDIA driver version: 470.57.02
CUDA version: 11.1.1

Machine 2:

OS version: Ubuntu 16.04 LTS
Python version: 3.7.10
Ray version: 1.5.0 from PyPI (tested with nightly build as well)
PyTorch version: 1.4.0
NVIDIA driver version: 430.64
CUDA version: 10.0.0

Same issue on Windows: https://discuss.ray.io/t/error-with-torch-policy-and-ray-get-gpu-ids-on-windows/2711

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

conda create --name test python=3.8 --yes
conda activate test
pip3 install https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
python3 -c 'import ray; print(ray.get_gpu_ids())'
nvidia-smi --list-gpus

If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 34 (20 by maintainers)

Most upvoted comments

Just in case anyone stumbles upon this issue and is running Ray locally, adding local_mode=True to the ray.init() call could fix the issue, just make sure Tensorflow or PyTorch are able to detect the GPU

ferrantedev on Aug 8, 2021

Isn’t this working as intended? GPU IDs are for assigned GPUs only, and are only valid to call in a remote function.

It will always be the empty list in the driver.

ericl on Jul 29, 2021

On python 3.11.5 with Ray 2.7.1, With ray.init(local_mode=True), get_gpu_ids() returned ['0'] Whereas with ray.init(), get_gpu_ids() returned []

tinducvo on Oct 26, 2023

I’ll provide a fix in TorchPolicy.

Same issue for TFPolicy as well. I tried to fix it using torch.cuda.device_count() and tf.config.list_physical_devices('GPU') in #17398.

XuehaiPan on Jul 29, 2021

Confirmed using Ray cluster launcher:

(base) ray@ip-172-31-16-127:~$ ray status
======== Autoscaler status: 2021-07-28 06:44:21.751891 ========
Node status
---------------------------------------------------------------
Healthy:
 1 cpu_4_ondemand
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/4.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:V100
 0.00/35.597 GiB memory
 0.00/17.799 GiB object_store_memory

Demands:
 (no resource demands)
(base) ray@ip-172-31-16-127:~$ nvidia-smi
Wed Jul 28 06:44:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   46C    P0    27W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(base) ray@ip-172-31-16-127:~$ ray status
======== Autoscaler status: 2021-07-28 06:44:48.189085 ========
Node status
---------------------------------------------------------------
Healthy:
 1 cpu_4_ondemand
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/4.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:V100
 0.00/35.597 GiB memory
 0.00/17.799 GiB object_store_memory

Demands:
 (no resource demands)
(base) ray@ip-172-31-16-127:~$ python3 -c 'import ray; ray.init(address="auto"); print(ray.get_gpu_ids())'
2021-07-28 06:44:55,502	INFO worker.py:736 -- Connecting to existing Ray cluster at address: 172.31.16.127:6379
[]

But:

{'node_ip_address': '172.31.16.127', 'raylet_ip_address': '172.31.16.127', 'redis_address': '172.31.16.127:6379', 'object_store_address': '/tmp/ray/session_2021-07-28_06-43-06_018883_312/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-07-28_06-43-06_018883_312/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2021-07-28_06-43-06_018883_312', 'metrics_export_port': 63257, 'node_id': '819b046909b777f273f163c1024c8fe92f7f6aeb005ac15a418e37a0'}
>>> ray.cluster_resources()
{'accelerator_type:V100': 1.0, 'memory': 38222325351.0, 'CPU': 4.0, 'GPU': 1.0, 'node:172.31.16.127': 1.0, 'object_store_memory': 19111162675.0}

>>> @ray.remote(num_gpus=1)
... def test_gpu():
...     import torch
...     print(torch.cuda.is_available())
... 
>>> ray.get(test_gpu.remote())
>>> (pid=603) True

So it might a problem with ray.get_gpu_ids()?

cc @ijrsvt

krfricke on Jul 28, 2021