ray: [core/tune] Placement groups stuck in pending mode

What happened + What you expected to happen

This is on an Anyscale cluster with Ray 1.12.0rc0 (currently trying to repro on 1.12.0 proper).

I’m in a session (on anyscale) where a Ray Tune job is blocked on waiting for placement groups.

Ray Tune scheduled enough placement groups:

cnt = Counter([pg[“state”] for pg in ray.util.placement_group_table().values()]) cnt Counter({‘REMOVED’: 202, ‘PENDING’: 100})

Each placement group requests one CPU bundle:

{‘placement_group_id’: ‘9db0e3f4efaa51e55509ccd3bf137a54’, ‘name’: ‘__tune_11c2264d__39702790’, ‘bundles’: {0: {‘CPU’: 1.0}}, ‘strategy’: ‘PACK’, ‘state’: ‘REMOVED’, ‘stats’: {‘end_to_end_creation_latency_ms’: 258.807, ‘scheduling_latency_ms’: 2.174, ‘scheduling_attempt’: 3, ‘highest_retry_delay_ms’: 150.0, ‘scheduling_state’: ‘REMOVED’}}

But they remain forever in PENDING state. Resources should be available:

======== Autoscaler status: 2022-04-27 08:26:52.164495 ======== Node status

Healthy: 4 worker-node 1 Head Pending: 10.0.2.51: worker-node, setting-up 10.0.2.24: worker-node, setting-up 10.0.2.152: worker-node, setting-up 10.0.2.41: worker-node, setting-up Recent failures: (no failures)

Resources

Usage: 16.0/112.0 CPU (16.0 used of 16.0 reserved in placement groups) 0.00/165.236 GiB memory 0.00/69.790 GiB object_store_memory

Demands: {‘CPU’: 1.0} * 1 (PACK): 99+ pending placement groups

The usage is odd btw as I can’t see where 16 CPUs are used (all placement groups are either REMOVED or PENDING).

It seems like resources don’t get free’d correctly.

This is an autoscaling cluster.

Might be related to https://github.com/ray-project/ray/issues/19143?

Versions / Dependencies

Ray 1.12.0rc0

Using production jobs.

Autoscaling cluster

Head node type

m5.2xlarge
Worker node 0

c4.4xlarge (min: 3, max: 10)

Reproduction script

Ask me on slack for a repro job

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (12 by maintainers)

Commits related to this issue

Most upvoted comments

A fix is in #24878 for which the repro passes for me.

I’ve found the culprit, it’s a faulty has_resources_for_trial implementation in the trial executor. Working on a fix now.

@Jet132 can you please share which Ray version this was on? 1.12.0? Just want to make sure we try out the right things when we repro