serving: Reconsider crashing the Activator when the WebSocket connection to Autoscaler is not established
In what area(s)?
/area autoscale /area networking
Describe the feature
Today, the Activator /healthz probe status is based on the status of the WebSocket connection between the Activator and the Autoscaler. As a result, if Autoscaler is not ready yet (very common) or unreachable for any reason, the liveliness probe fails and the Kubelet will restart the Activator container.
This is problematic because it shows up as a container restart in the Kubernetes metrics and is flagged as suspicious by any monitoring system. It feels like this is abusing the liveliness probe. Also, it can impact availability since Autoscaler being down will bring the Activator(s) down with it.
Shouldn’t the WebSocket connection be able to retry until success? That’s the standard behavior of any other system.
Thoughts? cc @vagababov, @dgerd, @mattmoor
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (15 by maintainers)
I’d go with more spaced and more repeats.