ray: [autoscaler] Too many workers are scaled in kubernetes
What is the problem?
- ray = 1.0.0
- autoscaling on k8s
When autoscaling in kubernetes, it happens sometimes (almost 50% in my case) that instead of 1 worker, 2 workers are scaled up.
Reproduction (REQUIRED)
- Have a ray autoscale cluster on kubernetes
- Deploy an actor that requires new worker resources
- Ray autoscale will start an extra worker
- When the first worker is (almost) ready, for some reason, the autoscaler triggers another worker to start
I see this happens because for some reason
NumNodesConnectedbecomes equal toNumNodesUsed.
- NumNodesConnected: 3
- NumNodesUsed: 3
A few seconds later (while the extra node is already being started and the actor is deployed) I see:
- NumNodesConnected: 3
- NumNodesUsed: 1.2
I’ve cut the important part out of the monitor.err logs monitor.err.txt
It seems like there is some race condition that causes the autoscaler to trigger a worker to start, while the actor is deploying…
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
BR, Pieterjan cc @edoakes
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (13 by maintainers)
Cc @AmeerHajAli @ericl
I’ll look into this, but feel free to steal it if you have the bandwidth