serving: Reconsider crashing the Activator when the WebSocket connection to Autoscaler is not established

In what area(s)?

/area autoscale /area networking

Describe the feature

Today, the Activator /healthz probe status is based on the status of the WebSocket connection between the Activator and the Autoscaler. As a result, if Autoscaler is not ready yet (very common) or unreachable for any reason, the liveliness probe fails and the Kubelet will restart the Activator container.

This is problematic because it shows up as a container restart in the Kubernetes metrics and is flagged as suspicious by any monitoring system. It feels like this is abusing the liveliness probe. Also, it can impact availability since Autoscaler being down will bring the Activator(s) down with it.

Shouldn’t the WebSocket connection be able to retry until success? That’s the standard behavior of any other system.

Thoughts? cc @vagababov, @dgerd, @mattmoor

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (15 by maintainers)

Most upvoted comments

I’d go with more spaced and more repeats.

vagababov on Jan 21, 2020