ray: [Bug] autoscaler doesn't shutdown workers after calling request_resources

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

What happens: After calling request_resources an appropriate number of worker nodes are spun up (expected), however, they never scale down, even if they are unused.

I observe the following in monitor.log even after reaching 3k CPU:

Demands:
 {'CPU': 1}: 3000+ from request_resources()

Expectation: Unused workers are shut down.

The old behaviour (ray 1.0.0 and possibly later) used to clear the resource request when the running resources exceeded the request amount, so that the request only lasts until the cluster has scaled up.

Versions / Dependencies

ray==1.6.0

Reproduction script

Boot a cluster, call request_resources(n) where n is greater than the head node resources, wait for the workers to boot, observe that the workers are never shutdown, even after the idle timeout elapses.

Anything else

I believe this functionality was removed as part of this PR: https://github.com/ray-project/ray/pull/11802/files#diff-32ec34dc41fb43d2614e8bc5a857aac8f9bdca6a985df52fc7f207d62e5cb162L161-L163 (search for if len(nodes) >= target_workers in autoscaler.py)

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 29 (29 by maintainers)

Most upvoted comments

hmmm can you do autoscaling_mode: inf? We should probably just support that instead of attempting to pick some arbitrary high number (though you could also just pick some higher number)