ray: [Rllib] `TorchPolicy` and `TFPolicy` cannot find any GPUs
What is the problem?
ray.get_gpu_ids()
gets an empty list on my machine when I’m using TorchPolicy
with config['num_gpu']
set. It will get an IndexError
at self.devices[0]
when using TorchPolicy
on GPUs:
This issue can be reproduced on multiple machines. Ray version and other system information (Python version, TensorFlow version, OS):
My Runtime Environment
Machine 1:
- OS version: Ubuntu 20.04 LTS
- Python version: 3.8.10
- Ray version: 1.5.0 from PyPI (tested with nightly build as well)
- PyTorch version: 1.9.0
- NVIDIA driver version: 470.57.02
- CUDA version: 11.1.1
Machine 2:
- OS version: Ubuntu 16.04 LTS
- Python version: 3.7.10
- Ray version: 1.5.0 from PyPI (tested with nightly build as well)
- PyTorch version: 1.4.0
- NVIDIA driver version: 430.64
- CUDA version: 10.0.0
Same issue on Windows: https://discuss.ray.io/t/error-with-torch-policy-and-ray-get-gpu-ids-on-windows/2711
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
conda create --name test python=3.8 --yes
conda activate test
pip3 install https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
python3 -c 'import ray; print(ray.get_gpu_ids())'
nvidia-smi --list-gpus
If the code snippet cannot be run by itself, the issue will be closed with “needs-repro-script”.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 34 (20 by maintainers)
Just in case anyone stumbles upon this issue and is running Ray locally, adding
local_mode=True
to theray.init()
call could fix the issue, just make sure Tensorflow or PyTorch are able to detect the GPUIsn’t this working as intended? GPU IDs are for assigned GPUs only, and are only valid to call in a remote function.
It will always be the empty list in the driver.
On python 3.11.5 with Ray 2.7.1, With
ray.init(local_mode=True)
,get_gpu_ids()
returned['0']
Whereas withray.init()
,get_gpu_ids()
returned[]
Same issue for
TFPolicy
as well. I tried to fix it usingtorch.cuda.device_count()
andtf.config.list_physical_devices('GPU')
in #17398.Confirmed using Ray cluster launcher:
But:
So it might a problem with
ray.get_gpu_ids()
?cc @ijrsvt