ray: [autoscaler] Too many workers are scaled in kubernetes

What is the problem?

When autoscaling in kubernetes, it happens sometimes (almost 50% in my case) that instead of 1 worker, 2 workers are scaled up.

Have a ray autoscale cluster on kubernetes
Deploy an actor that requires new worker resources
Ray autoscale will start an extra worker
When the first worker is (almost) ready, for some reason, the autoscaler triggers another worker to start I see this happens because for some reason NumNodesConnected becomes equal to NumNodesUsed.

 - NumNodesConnected: 3
 - NumNodesUsed: 3

A few seconds later (while the extra node is already being started and the actor is deployed) I see:

 - NumNodesConnected: 3
 - NumNodesUsed: 1.2

I’ve cut the important part out of the monitor.err logs monitor.err.txt

It seems like there is some race condition that causes the autoscaler to trigger a worker to start, while the actor is deploying…

BR, Pieterjan cc @edoakes

I’ll look into this, but feel free to steal it if you have the bandwidth