autoscaler: failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew

I am running on Kubernetes 12.5 with etcd3 with cluster-autoscaler v1.2.2 (on AWS) and my cluster is running healthy with everything operation. After some scaling activity. cluster autoscaler goes into crash loop with error as following:

F0205 23:32:52.241542       1 main.go:384] lost master
goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.stacks(0xc000022100, 0xc000574000, 0x37, 0xee)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:828 +0xd4
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).output(0x4333560, 0xc000000003, 0xc00056e000, 0x429c819, 0x7, 0x180, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:779 +0x306
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.(*loggingT).printf(0x4333560, 0x3, 0x26f2036, 0xb, 0x0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:678 +0x14b
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog.Fatalf(0x26f2036, 0xb, 0x0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/klog.go:1207 +0x67
main.main.func3()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:384 +0x47
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc000668000)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:163 +0x40
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc000668000, 0x29c4b00, 0xc000591dc0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:172 +0x112
k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.RunOrDie(0x29c4b40, 0xc000046040, 0x29cbd20, 0xc0001e6a20, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00001f030, 0x27baac0, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:184 +0x99
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:372 +0x5cf
I0205 23:32:52.241724       1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"e78ccdca-2440-11e9-8514-0a1153ba0cc4", APIVersion:"v1", ResourceVersion:"6949892", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-57f79874cf-c45xb stopped leading
I0205 23:32:52.745013       1 auto_scaling_groups.go:124] Registering ASG XXXX

Everything in cluster seem to work perfectly find and masters, cluster and etcd are all healthy.
Is there a way any way to resurrect/resolve this issue?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 43 (6 by maintainers)

Commits related to this issue

Most upvoted comments

I had a similar issue on my cluster (using EKS):

F0802 00:10:57.242174 1 main.go:384] lost master
I0802 00:10:57.242128 1 leaderelection.go:249] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
I0802 00:10:57.244543 1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"1fc342a0-4b63-11e9-b984-02635bc9a4cc", APIVersion:"v1", ResourceVersion:"27196690", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-aws-cluster-autoscaler-59fbbcb794-7kzfv stopped leading

Then the pod died and restarted, it seems to be an hiccup but I would like to know why that happened.

I have the same problem:

I0514 05:08:51.277989       1 leaderelection.go:281] failed to renew lease kube-system/cluster-autoscaler: failed to tryAcquireOrRenew context deadline exceeded
F0514 05:08:51.278016       1 main.go:409] lost master

I am running auto scaler version 1.15.6

For what it worth, if I do the following, it will crash less often. I think it really cut down the k8s API call and less chance for crashing.

        - --leader-elect=false

I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs image

leader-elect Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability true

If this is set to false as replied by @tkbrex , the election process is disabled and we will not see this lost master error.

leader-elect Start a leader election client and gain leadership before executing the main loop.Enable this when running replicated components for high availability true

We’re seeing this on EKS 1.21.

Seeing a weird behaviour with cluster-autoscaler, not sure what’s exactly causing this… Autoscaler Version: 1.21.1 Noticed few number of restarts, not resource limis/request set for CPU.

Describe the cluster-autoscaler pod shows:

    State:          Running
      Started:      Fri, 28 Oct 2022 18:05:10 +0530
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 28 Oct 2022 17:56:37 +0530
      Finished:     Fri, 28 Oct 2022 18:02:19 +0530
    Ready:          True
    Restart Count:  36

----------------------------------
Logs: 
```Ruby
1028 12:32:10.414618       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Pod total 0 items received
I1028 12:32:10.414628       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338: Watch close - *v1.Job total 8 items received
I1028 12:32:10.414642       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329: Watch close - *v1.ReplicationController total 0 items received
I1028 12:32:10.413723       1 reflector.go:530] k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188: Watch close - *v1.Pod total 9 items received
I1028 12:32:10.414657       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicaSet total 13 items received
E1028 12:32:12.445308       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 182..xs.x.x:443: connect: connection refused
E1028 12:32:15.453424       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 172.20.186.41:443: connect: connection refused
E1028 12:32:17.469406       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 172.20.186.41:443: connect: connection refused
E1028 12:32:19.457301       1 leaderelection.go:325] error retrieving resource lock kube-system/cluster-autoscaler: Get "https://182..xs.x.x:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": dial tcp 182..xs.x.x:443: connect: connection refused
I1028 12:32:19.832254       1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F1028 12:32:19.832296       1 main.go:450] lost master
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc0010267e0, 0x37, 0xd7)```

I have also seen most people are running on replicas(1) of CA and forgetting to check the default value for leader-elect=true according to the FAQs

Is disabling leader election really recommended? All of the official examples I’m aware of specify replicas: 1 but keep the default value for leader-elect.

Even when running replicas: 1, wouldn’t leader election be necessary during rolling updates of the CA deployment? Otherwise, I would think there’d be periods where you could have multiple CA pods stepping on each other.

I had this issue with autoscaler , with CPU limit set to 100m

E0325 00:25:02.404766       1 leaderelection.go:361] Failed to update lock: Put "https://<API>/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler": context deadline exceeded
I0325 00:25:02.404822       1 leaderelection.go:278] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0325 00:25:02.404843       1 main.go:450] lost master
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000182001, 0xc0002e01e0, 0x37, 0xed)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1021 +0xb8
...
...

setting limit to 1 CPU solved the issue (it needs more CPU when it starts) so in my case, it was a CPU throttling and it slowed down autoscaler itself

I have the same issue on EKS 1.19.