ray: [rllib] Custom model cannot use GPU for driver when running PPO algorithm

What is the problem?

When using the combination of a custom model, PPO, and a GPU for the driver, the following error appears:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation default_policy/lstm/bias/Initializer/concat: Could not satisfy explicit device specification '' because the node {{colocation_node default_policy/lstm/bias/Initializer/concat}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0].

Ray version and other system information (Python version, TensorFlow version, OS):

Ray 0.8.0 Python 3.6.6 tensorflow-gpu 2.0.0 Fedora 28

Does the problem occur on the latest wheels?

Yes, although it gives a different error, and a new combination of parameters fails as well: on custom_keras_model with num_gpus set to 1, where ray 0.8.0 does not fail with this combination. The error on the latest wheel is the following:

File "project/venv/lib64/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 356, in __init__ "GPUs were assigned to this worker by Ray, but " RuntimeError: GPUs were assigned to this worker by Ray, but TensorFlow reports GPU acceleration is disabled. This could be due to a bad CUDA or TF installation.

Summary

Ray version script num_gpus works?
0.8.0 custom_keras_model.py 0 Yes
0.8.0 custom_keras_model.py 1 Yes
0.8.0 custom_keras_rnn_model.py 0 Yes
0.8.0 custom_keras_rnn_model.py 1 No
latest wheel custom_keras_model.py 0 Yes
latest wheel custom_keras_model.py 1 No
latest wheel custom_keras_rnn_model.py 0 Yes
latest wheel custom_keras_rnn_model.py 1 No

Reproduction

Please note that this only reproduces the last row in the table. In order to test custom_keras_model.py, you also need to modify the algorithm used at the top of the file.

python3 -m venv venv . venv/bin/activate pip3 install --upgrade pip setuptools wheel pip3 install tensorflow-gpu==2.0.0 #Install [rllib] dependencies pip3 install ray[rllib]==0.8.0 pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl sed -i ‘158i “num_gpus”: 1,’ venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py python3 venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (5 by maintainers)

Most upvoted comments

Correct, the repro script above works with the combination of CUDA 10.0.130.1 and cuDNN v7.4.2. Thanks for the help!

I have the same problem with: Python: 3.6.6 Ray 0.8.5 Tensorflow-gpu 2.1.0 CUDA 10.0 CuDNN 7.6.5

Interestingly, with the same config I was able to instantiate a Trainer() and do trainer.train(). Using Ray.tune threw the above error.

Yes, I can reproduce your results with rllib==0.8.0 running script custom_keras_rnn_model.py. Looks like the GPU is recognized as a XLA_GPU and not a standard GPU. After looking around the webs, it appears to be an incompatibility issue with TF2.0 and the underlying CuDNN/CUDA drivers. I have CUDA 10.1 and CuDNN 7.6.2.24 which does not appear to be supported in this list - see bottom of page for TF2-gpu. You may have to go to CUDA 10.0 and CuDNN 7.4.