ray: [autoscaler] Too many workers are scaled in kubernetes

What is the problem?

  • ray = 1.0.0
  • autoscaling on k8s

When autoscaling in kubernetes, it happens sometimes (almost 50% in my case) that instead of 1 worker, 2 workers are scaled up.

Reproduction (REQUIRED)

  • Have a ray autoscale cluster on kubernetes
  • Deploy an actor that requires new worker resources
  • Ray autoscale will start an extra worker
  • When the first worker is (almost) ready, for some reason, the autoscaler triggers another worker to start I see this happens because for some reason NumNodesConnected becomes equal to NumNodesUsed.
 - NumNodesConnected: 3
 - NumNodesUsed: 3

A few seconds later (while the extra node is already being started and the actor is deployed) I see:

 - NumNodesConnected: 3
 - NumNodesUsed: 1.2

I’ve cut the important part out of the monitor.err logs monitor.err.txt

It seems like there is some race condition that causes the autoscaler to trigger a worker to start, while the actor is deploying…

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

BR, Pieterjan cc @edoakes

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (13 by maintainers)

Most upvoted comments

Cc @AmeerHajAli @ericl

I’ll look into this, but feel free to steal it if you have the bandwidth