ray: [core/tune] Placement groups stuck in pending mode
What happened + What you expected to happen
This is on an Anyscale cluster with Ray 1.12.0rc0 (currently trying to repro on 1.12.0 proper).
I’m in a session (on anyscale) where a Ray Tune job is blocked on waiting for placement groups.
Ray Tune scheduled enough placement groups:
cnt = Counter([pg[“state”] for pg in ray.util.placement_group_table().values()]) cnt Counter({‘REMOVED’: 202, ‘PENDING’: 100})
Each placement group requests one CPU bundle:
{‘placement_group_id’: ‘9db0e3f4efaa51e55509ccd3bf137a54’, ‘name’: ‘__tune_11c2264d__39702790’, ‘bundles’: {0: {‘CPU’: 1.0}}, ‘strategy’: ‘PACK’, ‘state’: ‘REMOVED’, ‘stats’: {‘end_to_end_creation_latency_ms’: 258.807, ‘scheduling_latency_ms’: 2.174, ‘scheduling_attempt’: 3, ‘highest_retry_delay_ms’: 150.0, ‘scheduling_state’: ‘REMOVED’}}
But they remain forever in PENDING state. Resources should be available:
======== Autoscaler status: 2022-04-27 08:26:52.164495 ======== Node status
Healthy: 4 worker-node 1 Head Pending: 10.0.2.51: worker-node, setting-up 10.0.2.24: worker-node, setting-up 10.0.2.152: worker-node, setting-up 10.0.2.41: worker-node, setting-up Recent failures: (no failures)
Resources
Usage: 16.0/112.0 CPU (16.0 used of 16.0 reserved in placement groups) 0.00/165.236 GiB memory 0.00/69.790 GiB object_store_memory
Demands: {‘CPU’: 1.0} * 1 (PACK): 99+ pending placement groups
The usage is odd btw as I can’t see where 16 CPUs are used (all placement groups are either REMOVED or PENDING).
It seems like resources don’t get free’d correctly.
This is an autoscaling cluster.
Might be related to https://github.com/ray-project/ray/issues/19143?
Versions / Dependencies
Ray 1.12.0rc0
Using production jobs.
Autoscaling cluster
Head node type
m5.2xlarge
Worker node 0
c4.4xlarge (min: 3, max: 10)
Reproduction script
Ask me on slack for a repro job
Issue Severity
High: It blocks me from completing my task.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (12 by maintainers)
Commits related to this issue
- [tune] Fix `has_resources_for_trial`, leading to trials stuck in PENDING mode (#24878) Tune resource bookkeeping was broken. Specifically, this is what happened in the repro provided in #24259: - ... — committed to ray-project/ray by krfricke 2 years ago
- [tune] Fix `has_resources_for_trial`, leading to trials stuck in PENDING mode (#24878) Tune resource bookkeeping was broken. Specifically, this is what happened in the repro provided in #24259: - ... — committed to ray-project/ray by krfricke 2 years ago
- [tune] Fix `has_resources_for_trial`, leading to trials stuck in PENDING mode (#24878) Tune resource bookkeeping was broken. Specifically, this is what happened in the repro provided in #24259: - ... — committed to krfricke/ray by krfricke 2 years ago
- [tune] Fix `has_resources_for_trial`, leading to trials stuck in PENDING mode (#24878) (#24933) Tune resource bookkeeping was broken. Specifically, this is what happened in the repro provided in #242... — committed to ray-project/ray by krfricke 2 years ago
A fix is in #24878 for which the repro passes for me.
I’ve found the culprit, it’s a faulty
has_resources_for_trialimplementation in the trial executor. Working on a fix now.@Jet132 can you please share which Ray version this was on? 1.12.0? Just want to make sure we try out the right things when we repro