autoscaler: cluster-autoscaler: crashes when k8s API is updated
We are using AWS EKS and when AWS periodically updates the EKS service, we see metrics-service crash. For example, last week the service was updated from v1.13.11 to 1.13.12 and this caused the pod to crash. Here’s the last state of the pod:
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Tue, 19 Nov 2019 02:03:57 +0100
Finished: Tue, 19 Nov 2019 02:04:27 +0100
There’s nothing really interesting in the logs at this time, just this:
I1119 01:03:57.820185 1 main.go:333] Cluster Autoscaler 1.13.1
F1119 01:04:27.821536 1 main.go:355] Failed to get nodes from apiserver: Get https://172.20.0.1:443/api/v1/nodes: dial tcp 172.20.0.1:443: i/o timeout
The metrics-server also crashed at the same time so perhaps an issue in one of the golang dependencies?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 23 (5 by maintainers)
em. EKS rolling upgrade will terminate the master. Load balancer has timeout if in-flight requests are not finished. For some extra cases, it’s possible that master node is not removed from itself and there’s dead backend. My teammate is working on more smooth upgrade improvement.
Makes sense. Happy to accept a PR 😃
It’s definitely a problem. It’s an error, check the reason and exit code. Cluster updates are happening every month or so and nothing else crashes in this process. We have monitoring and alerts for these events.
The CA should recover from this without exiting with non-zero status IMO 🙂
Why? We don’t see any restarts of the pod outside of crashes and updates?