ray: [Nightly] train_small timeout. Stuck in `Waiting until actors are ready` for more than 60seconds

Search before asking

I searched the issues and found no similar issues.

Ray Component

Others

What happened + What you expected to happen

train_small is timing out. Most notably waiting untill actors are ready takes more than 60s. This is after wait_cluster is done so no more autoscaling should be happening. Happens both in the case using ray client and not.

[INFO 2021-10-10 11:23:56,465] e2e.py: 825  Starting job train_small_1633889125 with Ray address: anyscale://train_small_1633889125?autosuspend=10
--
  | (pid=1279) 2021-10-10 11:24:23,274	INFO main.py:913 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
  | The actor or task with ID ffffffffffffffff625c288d32dc2a41fdedf2a402000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
  | Required resources for this actor or task: {CPU_group_cda3ad2861590c44996ec6a82e0b2cfc: 4.000000}
  | Available resources on this node: {0.000000/4.000000 CPU, 587202559.960938 GiB/587202559.960938 GiB memory, 233156159.960938 GiB/233156159.960938 GiB object_store_memory, 0.000000/4.000000 CPU_group_cda3ad2861590c44996ec6a82e0b2cfc, 1.000000/1.000000 node:172.31.120.249, 1000.000000/1000.000000 bundle_group_2_cda3ad2861590c44996ec6a82e0b2cfc, 1000.000000/1000.000000 bundle_group_cda3ad2861590c44996ec6a82e0b2cfc, 4.000000/4.000000 CPU_group_2_cda3ad2861590c44996ec6a82e0b2cfc}
  | In total there are 0 pending tasks and 1 pending actors on this node.
  | (pid=1279) 2021-10-10 11:24:53,302	INFO main.py:947 -- Waiting until actors are ready (30 seconds passed).
  | (pid=1279) 2021-10-10 11:25:23,375	INFO main.py:947 -- Waiting until actors are ready (60 seconds passed).
  | (pid=1279) 2021-10-10 11:25:36,855	INFO main.py:958 -- [RayXGBoost] Starting XGBoost training.

Link: https://buildkite.com/ray-project/periodic-ci/builds/1209#49028e00-affd-404f-b2b3-7309a875787b

Reproduction script

Nightly release test

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 36 (35 by maintainers)

Most upvoted comments

This sounds suspiciously like this P0 issue here which afaik is long outstanding as well: https://github.com/ray-project/ray/issues/19326

I think it could be a duplicate (@scv119 maybe this is another repro?)

ericl on Oct 20, 2021

yup. it is slow not stuck.

scv119 on Oct 26, 2021

Yes, that’s correct! Thank you @rkooo567! This test is now failing every day now.

xwjiang2010 on Oct 25, 2021

Just had a run with identical scheduling issue, but using use_connect=False. train_small logs:

2021-10-19 16:14:56,239 INFO worker.py:823 -- Connecting to existing Ray cluster at address: 172.31.29.213:6379
(pid=1263) 2021-10-19 16:15:05,607      INFO main.py:926 -- [RayXGBoost] Created 4 new actors (4 total actors). Waiting until actors are ready for training.
2021-10-19 16:15:23,937 WARNING worker.py:1228 -- The actor or task with ID ffffffffffffffff5bd80366f4075cc6aab2b9a002000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_dc13b8aff1b4c7a49f8e84c93b72b1cd: 4.000000}
Available resources on this node: {0.000000/4.000000 CPU, 587202559.960938 GiB/587202559.960938 GiB memory, 235618919.970703 GiB/235618919.970703 GiB object_store_memory, 4.000000/4.000000 CPU_group_3_dc13b8aff1b4c7a49f8e84c93b72b1cd, 1000.000000/1000.000000 bundle_group_3_dc13b8aff1b4c7a49f8e84c93b72b1cd, 0.000000/4.000000 CPU_group_dc13b8aff1b4c7a49f8e84c93b72b1cd, 1000.000000/1000.000000 bundle_group_dc13b8aff1b4c7a49f8e84c93b72b1cd, 1.000000/1.000000 node:172.31.36.181}
In total there are 0 pending tasks and 1 pending actors on this node.
(pid=1263) 2021-10-19 16:15:35,692      INFO main.py:960 -- Waiting until actors are ready (30 seconds passed).
(pid=1263) 2021-10-19 16:16:05,773      INFO main.py:960 -- Waiting until actors are ready (60 seconds passed).
(pid=1263) 2021-10-19 16:16:10,288      INFO main.py:971 -- [RayXGBoost] Starting XGBoost training.
(pid=1338) [16:16:10] task [xgboost.ray]:139862648240592 got new rank 0
(pid=170, ip=172.31.36.181) [16:16:10] task [xgboost.ray]:139816851543376 got new rank 2
(pid=164, ip=172.31.47.133) [16:16:10] task [xgboost.ray]:140690292563280 got new rank 3
(pid=164, ip=172.31.36.16) [16:16:10] task [xgboost.ray]:140326598614672 got new rank 1
Waiting for more nodes to come up: 0/1 (0 seconds passed)
Cluster is up: 4/1 nodes online after 5 seconds
PASSED.
(pid=1263) 2021-10-19 16:16:26,880      INFO main.py:1453 -- [RayXGBoost] Finished XGBoost training on training data with total N=250,000 in 83.81 seconds (16.59 pure XGBoost training time).
(train pid=1263) TRAIN TIME TAKEN: 83.82 seconds
(train pid=1263) Final training error: 0.3352

I suspect this might just be a general autoscaler/scheduler issue, and not client specific.

ckw017 on Oct 19, 2021