ray: [Bug] Ray Autoscaler is not spinning down idle nodes due to secondary object copies
Search before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Tune
What happened + What you expected to happen
Ray Autoscaler is not spinning down idle nodes if they ever ran a trial for the active Ray Tune job
The issue is seen on Ray Tune on a CPU-head w/GPU-workers (min=0,max=9) Ray 1.9.1 cluster.
The Higgs Ray Tune job is set up to run up to 10 trials using async hyperband for 1hr with
max_concurrency of 3. I see at most 3 trials running (each requiring 1 gpu and 4 cpus).
Except in the first logging at the job startup, no PENDING trials are reported.
At the time the Ray Tune job is stopped for the 1hr time limit at 12:43:48, the console log (see below) shows:
*) 3 nodes running Higgs trials (10.0.4.71, 10.0.6.5, 10.0.4.38)
*) 2 nodes that previously ran Higgs trials but are not doing so now (10.0.2.129, 10.0.3.245).
The latter 2 nodes last reported running trials at 12:21:30, so they should be spun down.
Note that, in this run, multiple Ray Tune jobs were running in the same Ray cluster with some overlap:
MushroomEdibility Ray Tune 1hr job ran from 11:20-l2:20
ForestCover Ray tune 1hr job ran from 11:22-12:22
Higgs Ray Tune 1hr job ran from 11:45-12:45
After 12:22, there was no overlap of jobs and hence no other use of 2 idle workers that remained on other than the historical Higgs use.
Two other nodes that became idle when MushroomEdibility and ForestCover completed were spun down at that point, leaving the other 2 idle nodes that higgs had used running.
In the same kind of scenario later in the run, I observed that after the Higgs job completed, all Higgs trial workers were spun down.
Current time: 2022-01-24 12:43:48 (running for 00:59:30.27) ...
Number of trials: 9/10 (3 RUNNING, 6 TERMINATED)
+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+
| Trial name | status | loc | combiner.bn_momentum | combiner.bn_virtual_bs | combiner.num_steps | combiner.output_size | combiner.relaxation_factor | combiner.size | combiner.sparsity | training.batch_size | training.decay_rate | training.decay_steps | training.learning_rate | iter | total time (s) | metric_score |
|----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------|
| trial_04bb0f22 | RUNNING | 10.0.4.71:1938 | 0.7 | 2048 | 4 | 8 | 1 | 32 | 0.0001 | 8192 | 0.95 | 500 | 0.025 | 18 | 3566.91 | 0.489641 |
| trial_3787a9c4 | RUNNING | 10.0.6.5:17263 | 0.9 | 4096 | 7 | 24 | 1.5 | 64 | 0 | 256 | 0.9 | 10000 | 0.01 | | | |
| trial_39a0ad6e | RUNNING | 10.0.4.38:8657 | 0.8 | 256 | 3 | 16 | 1.2 | 64 | 0.001 | 4096 | 0.95 | 2000 | 0.005 | 4 | 1268.2 | 0.50659 |
| trial_05396980 | TERMINATED | 10.0.2.129:2985 | 0.8 | 256 | 9 | 128 | 1 | 32 | 0 | 2048 | 0.95 | 10000 | 0.005 | 1 | 913.295 | 0.53046 |
| trial_059befa6 | TERMINATED | 10.0.3.245:282 | 0.98 | 1024 | 3 | 8 | 1 | 8 | 1e-06 | 1024 | 0.8 | 500 | 0.005 | 1 | 316.455 | 0.573849 |
| trial_c433a60c | TERMINATED | 10.0.3.245:281 | 0.8 | 1024 | 7 | 24 | 2 | 8 | 0.001 | 256 | 0.95 | 20000 | 0.01 | 1 | 1450.99 | 0.568653 |
| trial_277d1a8a | TERMINATED | 10.0.4.38:8658 | 0.9 | 256 | 5 | 64 | 1.5 | 64 | 0.0001 | 512 | 0.95 | 20000 | 0.005 | 1 | 861.914 | 0.56506 |
| trial_26f6b0b0 | TERMINATED | 10.0.2.129:3079 | 0.6 | 256 | 3 | 16 | 1.2 | 16 | 0.01 | 1024 | 0.9 | 8000 | 0.005 | 1 | 457.482 | 0.56582 |
| trial_2acddc5e | TERMINATED | 10.0.3.245:504 | 0.6 | 512 | 5 | 32 | 2 | 8 | 0 | 2048 | 0.95 | 10000 | 0.025 | 1 | 447.483 | 0.594953 |
+----------------+------------+-----------------+------------------------+--------------------------+----------------------+------------------------+------------------------------+-----------------+---------------------+-----------------------+-----------------------+------------------------+--------------------------+--------+------------------+----------------+
Versions / Dependencies
Ray 1.9.1
Reproduction script
https://github.com/ludwig-ai/experiments/blob/main/automl/validation/run_nodeless.sh run with Ray deployed on a K8s cluster. Can provide the Ray deployment script if desired.
Anything else
This problem is highly reproducible for me.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 67 (29 by maintainers)
Hmm do we know what created those objects and what is referencing them? “ray memory” can show you more information on this.
Yes, thank you @mwtian and @iycheng that worked! At least it worked fine in my single worker node repro scenario above, so hopefully, it will work in general.
I was able to do the “pip install” for the workers using the “setupCommands:” and I was able to the “pip install” for the head here:
I vote for this approach.
Still looking at this. Will have a fix soon.
My experimental run last night with this change looked great, with the Ray Autoscaler scaling down as desired. Thank you very much!!
I did not restart the Ray Cluster after upgrading. But “ray --version” showed the expected version version 2.0.0.dev0 after upgrading on both the head and the worker.
In a k8s deployment, the autoscaler is running in a separate operator node; that node shows that it is running version 2.0.0.dev0 as well.
I can run the pip install in the setupCommands for the workers, but unfortunately, Ray does not support specifying setupCommands for the head when deploying onto K8s. But what I think I can do is bring up the Ray head (with 0 workers), update the head, and then restart the head, if you think that may work.
I’d recommend if possible running head and worker images with the changes built-in rather than running pip install at runtime or in setup commands.