autoscaler: cluster-autoscaler: crashes when k8s API is updated

We are using AWS EKS and when AWS periodically updates the EKS service, we see metrics-service crash. For example, last week the service was updated from v1.13.11 to 1.13.12 and this caused the pod to crash. Here’s the last state of the pod:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Tue, 19 Nov 2019 02:03:57 +0100
      Finished:     Tue, 19 Nov 2019 02:04:27 +0100

There’s nothing really interesting in the logs at this time, just this:

I1119 01:03:57.820185       1 main.go:333] Cluster Autoscaler 1.13.1
F1119 01:04:27.821536       1 main.go:355] Failed to get nodes from apiserver: Get https://172.20.0.1:443/api/v1/nodes: dial tcp 172.20.0.1:443: i/o timeout

The metrics-server also crashed at the same time so perhaps an issue in one of the golang dependencies?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 23 (5 by maintainers)

Most upvoted comments

em. EKS rolling upgrade will terminate the master. Load balancer has timeout if in-flight requests are not finished. For some extra cases, it’s possible that master node is not removed from itself and there’s dead backend. My teammate is working on more smooth upgrade improvement.

Yes but it could perhaps retry in a loop for a while before exiting with error?

Makes sense. Happy to accept a PR 😃

Why do you see it as a problem?

It’s definitely a problem. It’s an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.

The CA should recover from this without exiting with non-zero status IMO 🙂

The kubelet or deployment controller will be restarting CA on regular basis anyway.

Why? We don’t see any restarts of the pod outside of crashes and updates?