ray: [autoscaler][rllib] RLlib cluster is not autoscaling up

Tl;DR

The anwser is easy, setting RAY_ADDRESS=auto environment variable and extend idle time. To play with:

a minimal ray aws cluster yaml: gpu.yaml
rllib (algo: DDPPO) yaml: rllib_test.yaml

And run the whole framework by executing following lines in order:

ray up gpu.yaml
ray rsync_up gpu.yaml './rllib_test.yaml' '~/rllib_test.yaml'
ray exec gpu.yaml "RAY_ADDRESS=auto rllib train -f rllib_test.yaml"

Description

A quick question. Am I understanding it correctly that in order to run RLlib examples in a Ray Cluster, we simply need to run:

ray exec ray-ml.yaml "rllib -f example.py"

and the cluster will auto-scale rllib applications? Or will it just run on the head without workers spinning up?

Is there any example about submitting RLLib jobs to Ray Clusters?

I’m following this https://docs.ray.io/en/latest/cluster/commands.html#cluster-commands to set up ray clusters and everything works smoothly. The team did a really good job on providing examples. However, I searched through internet and barely find useful information/examples for running “Tune” and “RLlib” with “Ray Cluster”.

Link

No response

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 20 (19 by maintainers)

Most upvoted comments

can we mitigate that by just setting the idle timeout high? idle_timeout_minutes: 9999?

wuisawesome on Jul 13, 2022

@wuisawesome yes we can LoL. It works now!!!

Didn’t expect the answer is so easy. Many thanks for your help!

spacegoing on Jul 13, 2022