ray: [Nightly] train_small timeout. Stuck in `Waiting until actors are ready` for more than 60seconds
Search before asking
- I searched the issues and found no similar issues.
Ray Component
Others
What happened + What you expected to happen
train_small is timing out. Most notably waiting untill actors are ready takes more than 60s. This is after wait_cluster is done so no more autoscaling should be happening.
Happens both in the case using ray client and not.
[INFO 2021-10-10 11:23:56,465] e2e.py: 825 Starting job train_small_1633889125 with Ray address: anyscale://train_small_1633889125?autosuspend=10
--
| (pid=1279) 2021-10-10 11:24:23,274 INFO main.py:913 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
| The actor or task with ID ffffffffffffffff625c288d32dc2a41fdedf2a402000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
| Required resources for this actor or task: {CPU_group_cda3ad2861590c44996ec6a82e0b2cfc: 4.000000}
| Available resources on this node: {0.000000/4.000000 CPU, 587202559.960938 GiB/587202559.960938 GiB memory, 233156159.960938 GiB/233156159.960938 GiB object_store_memory, 0.000000/4.000000 CPU_group_cda3ad2861590c44996ec6a82e0b2cfc, 1.000000/1.000000 node:172.31.120.249, 1000.000000/1000.000000 bundle_group_2_cda3ad2861590c44996ec6a82e0b2cfc, 1000.000000/1000.000000 bundle_group_cda3ad2861590c44996ec6a82e0b2cfc, 4.000000/4.000000 CPU_group_2_cda3ad2861590c44996ec6a82e0b2cfc}
| In total there are 0 pending tasks and 1 pending actors on this node.
| (pid=1279) 2021-10-10 11:24:53,302 INFO main.py:947 -- Waiting until actors are ready (30 seconds passed).
| (pid=1279) 2021-10-10 11:25:23,375 INFO main.py:947 -- Waiting until actors are ready (60 seconds passed).
| (pid=1279) 2021-10-10 11:25:36,855 INFO main.py:958 -- [RayXGBoost] Starting XGBoost training.
Link: https://buildkite.com/ray-project/periodic-ci/builds/1209#49028e00-affd-404f-b2b3-7309a875787b
Reproduction script
Nightly release test
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 36 (35 by maintainers)
This sounds suspiciously like this P0 issue here which afaik is long outstanding as well: https://github.com/ray-project/ray/issues/19326
I think it could be a duplicate (@scv119 maybe this is another repro?)
yup. it is slow not stuck.
Yes, that’s correct! Thank you @rkooo567! This test is now failing every day now.
Just had a run with identical scheduling issue, but using use_connect=False. train_small logs:
I suspect this might just be a general autoscaler/scheduler issue, and not client specific.