autoscaler: failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew

I am running on Kubernetes 12.5 with etcd3 with cluster-autoscaler v1.2.2 (on AWS) and my cluster is running healthy with everything operation. After some scaling activity. cluster autoscaler goes into crash loop with error as following:

F0205 23:32:52.241542       1 main.go:384] lost master
goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0xc000022100, 0xc000574000, 0x37, 0xee)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:828 +0xd4
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).output(0x4333560, 0xc000000003, 0xc00056e000, 0x429c819, 0x7, 0x180, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:779 +0x306
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).printf(0x4333560, 0x3, 0x26f2036, 0xb, 0x0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:678 +0x14b
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(0x26f2036, 0xb, 0x0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1207 +0x67
main.main.func3()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:384 +0x47
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000668000)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:163 +0x40
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000668000, 0x29c4b00, 0xc000591dc0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:172 +0x112
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x29c4b40, 0xc000046040, 0x29cbd20, 0xc0001e6a20, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00001f030, 0x27baac0, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:184 +0x99
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:372 +0x5cf
I0205 23:32:52.241724       1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"e78ccdca-2440-11e9-8514-0a1153ba0cc4", APIVersion:"v1", ResourceVersion:"6949892", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-57f79874cf-c45xb stopped leading
I0205 23:32:52.745013       1 auto_scaling_groups.go:124] Registering ASG XXXX

Everything in cluster seem to work perfectly find and masters, cluster and etcd are all healthy.
Is there a way any way to resurrect/resolve this issue?

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 43 (6 by maintainers)

Commits related to this issue

cluster-autoscaler-autodiscover.yaml sets --leader-elect=false in cluster-autoscaler Deployment (https://github.com/kubernetes/autoscaler/issues/1653) — committed to mr3project/mr3-run-k8s by deleted user 2 years ago

Most upvoted comments

I had a similar issue on my cluster (using EKS):

F0802 00:10:57.242174 1 main.go:384] lost master
I0802 00:10:57.242128 1 leaderelection.go:249] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
I0802 00:10:57.244543 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"1fc342a0-4b63-11e9-b984-02635bc9a4cc", APIVersion:"v1", ResourceVersion:"27196690", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-aws-cluster-autoscaler-59fbbcb794-7kzfv stopped leading

Then the pod died and restarted, it seems to be an hiccup but I would like to know why that happened.

+14

Sytten on Aug 2, 2019

I have the same problem:

I0514 05:08:51.277989       1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F0514 05:08:51.278016       1 main.go:409] lost master

I am running auto scaler version 1.15.6

For what it worth, if I do the following, it will crash less often. I think it really cut down the k8s API call and less chance for crashing.

        - --leader-elect=false

tkbrex on May 15, 2020

I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs

leader-elect	Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability	true

If this is set to false as replied by @tkbrex , the election process is disabled and we will not see this lost master error.

leader-elect	Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability	true

chaitushiva on May 25, 2020

We’re seeing this on EKS 1.21.

mhemken-vts on Apr 20, 2022

Seeing a weird behaviour with cluster-autoscaler, not sure what’s exactly causing this… Autoscaler Version: 1.21.1 Noticed few number of restarts, not resource limis/request set for CPU.

Describe the cluster-autoscaler pod shows:

    State:          Running
      Started:      Fri, 28 Oct 2022 18:05:10 +0530
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 28 Oct 2022 17:56:37 +0530
      Finished:     Fri, 28 Oct 2022 18:02:19 +0530
    Ready:          True
    Restart Count:  36

----------------------------------
Logs: 
```Ruby
1028 12:32:10.414618       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 0 items received
I1028 12:32:10.414628       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338: Watch close - *v1.Job total 8 items received
I1028 12:32:10.414642       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329: Watch close - *v1.ReplicationController total 0 items received
I1028 12:32:10.413723       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188: Watch close - *v1.Pod total 9 items received
I1028 12:32:10.414657       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicaSet total 13 items received
E1028 12:32:12.445308       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 182..xs.x.x:443: connect: connection refused
E1028 12:32:15.453424       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 172.20.186.41:443: connect: connection refused
E1028 12:32:17.469406       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 172.20.186.41:443: connect: connection refused
E1028 12:32:19.457301       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 182..xs.x.x:443: connect: connection refused
I1028 12:32:19.832254       1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F1028 12:32:19.832296       1 main.go:450] lost master
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc0010267e0, 0x37, 0xd7)```

decipher27 on Oct 28, 2022

I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs

Is disabling leader election really recommended? All of the official examples I’m aware of specify replicas: 1 but keep the default value for leader-elect.

Even when running replicas: 1, wouldn’t leader election be necessary during rolling updates of the CA deployment? Otherwise, I would think there’d be periods where you could have multiple CA pods stepping on each other.

gabegorelick on Jan 5, 2021

I had this issue with autoscaler , with CPU limit set to 100m

E0325 00:25:02.404766       1 leaderelection.go:361] Failed to update lock: Put "https://<API>/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0325 00:25:02.404822       1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0325 00:25:02.404843       1 main.go:450] lost master
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000182001, 0xc0002e01e0, 0x37, 0xed)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1021 +0xb8
...
...

setting limit to 1 CPU solved the issue (it needs more CPU when it starts) so in my case, it was a CPU throttling and it slowed down autoscaler itself

alex0z1 on Mar 25, 2022

I have the same issue on EKS 1.19.

yongzhang on May 19, 2021