ray: [autoscaler][rllib] RLlib cluster is not autoscaling up
Tl;DR
The anwser is easy, setting RAY_ADDRESS=auto environment variable and extend idle time. To play with:
- a minimal ray aws cluster yaml: gpu.yaml
- rllib (algo: DDPPO) yaml: rllib_test.yaml
And run the whole framework by executing following lines in order:
ray up gpu.yaml
ray rsync_up gpu.yaml './rllib_test.yaml' '~/rllib_test.yaml'
ray exec gpu.yaml "RAY_ADDRESS=auto rllib train -f rllib_test.yaml"
Description
A quick question. Am I understanding it correctly that in order to run RLlib examples in a Ray Cluster, we simply need to run:
ray exec ray-ml.yaml "rllib -f example.py"
and the cluster will auto-scale rllib applications? Or will it just run on the head without workers spinning up?
Is there any example about submitting RLLib jobs to Ray Clusters?
I’m following this https://docs.ray.io/en/latest/cluster/commands.html#cluster-commands to set up ray clusters and everything works smoothly. The team did a really good job on providing examples. However, I searched through internet and barely find useful information/examples for running “Tune” and “RLlib” with “Ray Cluster”.
Link
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (19 by maintainers)
can we mitigate that by just setting the idle timeout high?
idle_timeout_minutes: 9999?@wuisawesome yes we can LoL. It works now!!!
Didn’t expect the answer is so easy. Many thanks for your help!