k3s: K3s keeps crashing every few minutes due to leader election failure

Environmental Info: K3s Version:

k3s version v1.18.9+k3s1 (630bebf9) k3os version v0.11.1 <-- this occurred in 10.0 as well

Node(s) CPU architecture, OS, and Version:

Linux k3s-1 4.19.97-v8+ #1294 SMP PREEMPT Thu Jan 30 13:27:08 GMT 2020 aarch64 GNU/Linux

2 - Raspberry PI 4’s running on Class 10 UHS-1 SD Cards

Cluster Configuration:

1 master, 1 agent

Describe the bug:

After a new install of k3os/k3s, etcd keeps crashing every few minutes.

Steps To Reproduce:

Installed K3s on the server with the following options server --cluster-init --disable=traefik,servicelb
Add agent to existing cluster with the option agent

Expected behavior:

I would expect kubernetes to run stable.

Actual behavior:

Kubernetes crashes every couple of minutes after. Pods that rely on leader election also seem to crash.

Additional context / logs:

Logs

Trace[148579825]: [661.953235ms] [661.523057ms] Object stored in database
I1202 20:45:01.040074    8976 trace.go:116] Trace[1934252170]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-12-02 20:44:59.468270221 +0000 UTC m=+81.601702519) (total time: 1.571678379s):
Trace[1934252170]: [1.571534493s] [1.569092018s] Transaction committed
I1202 20:45:01.040962    8976 trace.go:116] Trace[158077054]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler,user-agent:k3s/v1.18.9+k3s1 (linux/arm64) kubernetes/630bebf/leader-election,client:127.0.0.1 (started: 2020-12-02 20:44:59.46773849 +0000 UTC m=+81.601170788) (total time: 1.573054522s):
Trace[158077054]: [1.572577086s] [1.572250481s] Object stored in database
I1202 20:45:04.703977    8976 trace.go:116] Trace[1309380047]: "List etcd3" key:/events,resourceVersion:,limit:1000,continue: (started: 2020-12-02 20:45:03.793147584 +0000 UTC m=+85.926579882) (total time: 910.53336ms):
Trace[1309380047]: [910.53336ms] [910.53336ms] END
I1202 20:45:04.968218    8976 trace.go:116] Trace[560134109]: "List" url:/api/v1/events,user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Lens/3.6.7 Chrome/83.0.4103.122 Electron/9.1.2 Safari/537.36,client:127.0.0.1 (started: 2020-12-02 20:45:03.792894978 +0000 UTC m=+85.926327275) (total time: 1.174766967s):
Trace[560134109]: [912.909671ms] [912.674175ms] Listing from storage done
Trace[560134109]: [1.174751301s] [261.84163ms] Writing http response done count:517
E1202 20:45:10.124876    8976 controller.go:178] failed to update node lease, error: Put https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/k3s-1?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
time="2020-12-02T20:45:10.130410420Z" level=error msg="error in txn: context canceled"
I1202 20:45:10.125073    8976 trace.go:116] Trace[1690869822]: "GuaranteedUpdate etcd3" type:*coordination.Lease (started: 2020-12-02 20:45:00.129753967 +0000 UTC m=+82.263186265) (total time: 9.995187714s):
Trace[1690869822]: [9.995187714s] [9.992894569s] END
I1202 20:45:10.131045    8976 trace.go:116] Trace[2116949679]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/k3s-1,user-agent:k3s/v1.18.9+k3s1 (linux/arm64) kubernetes/630bebf,client:127.0.0.1 (started: 2020-12-02 20:45:00.129205625 +0000 UTC m=+82.262637923) (total time: 10.001699452s):
Trace[2116949679]: [10.001699452s] [10.001346384s] END
I1202 20:45:11.194877    8976 trace.go:116] Trace[1638256539]: "GuaranteedUpdate etcd3" type:*core.Endpoints (started: 2020-12-02 20:45:01.181796158 +0000 UTC m=+83.315228456) (total time: 10.01293533s):
Trace[1638256539]: [9.162288138s] [9.160022085s] Transaction committed
Trace[1638256539]: [10.01293533s] [846.865017ms] END
E1202 20:45:13.063493    8976 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I1202 20:45:13.065564    8976 trace.go:116] Trace[1118593115]: "Update" url:/api/v1/namespaces/kube-system/endpoints/rancher.io-local-path,user-agent:local-path-provisioner/v0.0.0 (linux/arm64) kubernetes/$Format,client:10.42.0.5 (started: 2020-12-02 20:45:01.181165632 +0000 UTC m=+83.314597930) (total time: 11.884231395s):
Trace[1118593115]: [11.884231395s] [11.883804717s] END
E1202 20:45:11.696999    8976 leaderelection.go:356] Failed to update lock: Put https://127.0.0.1:6444/api/v1/namespaces/kube-system/endpoints/cloud-controller-manager: context deadline exceeded
I1202 20:45:11.697467    8976 trace.go:116] Trace[463620426]: "GuaranteedUpdate etcd3" type:*core.Endpoints (started: 2020-12-02 20:45:01.74087159 +0000 UTC m=+83.874303888) (total time: 9.956464373s):
Trace[463620426]: [9.039221883s] [9.036212381s] Transaction committed
Trace[463620426]: [9.956464373s] [915.207378ms] END
E1202 20:45:11.953410    8976 leaderelection.go:356] Failed to update lock: Put https://127.0.0.1:6444/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded
I1202 20:45:11.954804    8976 trace.go:116] Trace[22746264]: "GuaranteedUpdate etcd3" type:*core.Endpoints (started: 2020-12-02 20:45:01.99803945 +0000 UTC m=+84.131471747) (total time: 9.956183886s):
Trace[22746264]: [9.003243307s] [9.00081496s] Transaction committed
Trace[22746264]: [9.956183886s] [950.672304ms] END
I1202 20:45:12.000414    8976 trace.go:116] Trace[683943658]: "Update" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:k3s/v1.18.9+k3s1 (linux/arm64) kubernetes/630bebf/leader-election,client:127.0.0.1 (started: 2020-12-02 20:45:01.997261574 +0000 UTC m=+84.130693872) (total time: 10.002829094s):
Trace[683943658]: [10.002829094s] [10.002248863s] END
E1202 20:45:13.067344    8976 leaderelection.go:356] Failed to update lock: Put https://127.0.0.1:6444/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: context deadline exceeded
I1202 20:45:13.067959    8976 trace.go:116] Trace[924601370]: "GuaranteedUpdate etcd3" type:*core.Endpoints (started: 2020-12-02 20:45:03.431252866 +0000 UTC m=+85.564685256) (total time: 9.636551546s):
Trace[924601370]: [7.795962966s] [7.783669052s] Transaction committed
Trace[924601370]: [9.636551546s] [1.838588614s] END
I1202 20:45:13.078654    8976 leaderelection.go:277] failed to renew lease kube-system/cloud-controller-manager: timed out waiting for the condition
E1202 20:45:13.082806    8976 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I1202 20:45:13.083306    8976 leaderelection.go:277] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
I1202 20:45:13.088083    8976 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"kube-controller-manager", UID:"2b095a6b-cf7c-4767-bbb2-ca5d69073af6", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"6053", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' k3s-1_4f23b3d9-b58d-4c79-84bd-854a7d7c6afa stopped leading
I1202 20:45:13.088647    8976 leaderelection.go:277] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
I1202 20:45:13.091349    8976 event.go:278] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"cloud-controller-manager", UID:"d10eee0b-8e31-4b7d-92ed-0035622e0c74", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"6052", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' k3s-1_b78da5f0-164b-4354-b477-803bd79ff913 stopped leading
E1202 20:45:13.094074    8976 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
F1202 20:45:13.104741    8976 controllermanager.go:213] leaderelection lost
I1202 20:45:13.108478    8976 trace.go:116] Trace[955622096]: "Update" url:/api/v1/namespaces/kube-system/endpoints/cloud-controller-manager,user-agent:k3s/v1.18.9+k3s1 (linux/arm64) kubernetes/630bebf/leader-election,client:127.0.0.1 (started: 2020-12-02 20:45:01.740110714 +0000 UTC m=+83.873542975) (total time: 11.368219885s):
Trace[955622096]: [11.368219885s] [11.36765371s] END
F1202 20:45:13.109628    8976 controllermanager.go:279] leaderelection lost
I1202 20:45:13.113429    8976 endpointslice_controller.go:229] Shutting down endpoint slice controller
I1202 20:45:13.114256    8976 event.go:278] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' k3s-1_4f23b3d9-b58d-4c79-84bd-854a7d7c6afa stopped leading
F1202 20:45:13.114557    8976 server.go:244] leaderelection lost
I1202 20:45:13.114653    8976 event.go:278] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' k3s-1_b78da5f0-164b-4354-b477-803bd79ff913 stopped leading
I1202 20:45:13.115929    8976 trace.go:116] Trace[1502416122]: "Update" url:/api/v1/namespaces/kube-system/endpoints/kube-scheduler,user-agent:k3s/v1.18.9+k3s1 (linux/arm64) kubernetes/630bebf/leader-election,client:127.0.0.1 (started: 2020-12-02 20:45:03.429242772 +0000 UTC m=+85.562675310) (total time: 9.686487495s):
Trace[1502416122]: [9.686487495s] [9.684921448s] END
I1202 20:45:13.930719    8976 ttl_controller.go:130] Shutting down TTL controller
I1202 20:45:13.930902    8976 deployment_controller.go:165] Shutting down deployment controller
I1202 20:45:13.930962    8976 pvc_protection_controller.go:113] Shutting down PVC protection controller
I1202 20:45:13.930994    8976 daemon_controller.go:300] Shutting down daemon sets controller
I1202 20:45:13.931013    8976 expand_controller.go:331] Shutting down expand controller
I1202 20:45:13.931113    8976 pv_controller_base.go:311] Shutting down persistent volume controller
I1202 20:45:13.931177    8976 replica_set.go:194] Shutting down replicaset controller
I1202 20:45:13.931187    8976 gc_controller.go:100] Shutting down GC controller
I1202 20:45:13.931333    8976 node_lifecycle_controller.go:593] Shutting down node controller
I1202 20:45:13.931394    8976 job_controller.go:157] Shutting down job controller
I1202 20:45:13.931514    8976 clusterroleaggregation_controller.go:161] Shutting

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 23 (9 by maintainers)

Most upvoted comments

I1202 20:45:13.065564    8976 trace.go:116] Trace[1118593115]: "Update" url:/api/v1/namespaces/kube-system/endpoints/rancher.io-local-path,user-agent:local-path-provisioner/v0.0.0 (linux/arm64) kubernetes/$Format,client:10.42.0.5 (started: 2020-12-02 20:45:01.181165632 +0000 UTC m=+83.314597930) (total time: 11.884231395s):
Trace[1118593115]: [11.884231395s] [11.883804717s] END
E1202 20:45:11.696999    8976 leaderelection.go:356] Failed to update lock: Put https://127.0.0.1:6444/api/v1/namespaces/kube-system/endpoints/cloud-controller-manager: context deadline exceeded
I1202 20:45:11.697467    8976 trace.go:116] Trace[463620426]: "GuaranteedUpdate etcd3" type:*core.Endpoints (started: 2020-12-02 20:45:01.74087159 +0000 UTC m=+83.874303888) (total time: 9.956464373s):
Trace[463620426]: [9.039221883s] [9.036212381s] Transaction committed
Trace[463620426]: [9.956464373s] [915.207378ms] END
E1202 20:45:11.953410    8976 leaderelection.go:356] Failed to update lock: Put https://127.0.0.1:6444/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s: context deadline exceeded
I1202 20:45:11.954804    8976 trace.go:116] Trace[22746264]: "GuaranteedUpdate etcd3" type:*core.Endpoints (started: 2020-12-02 20:45:01.99803945 +0000 UTC m=+84.131471747) (total time: 9.956183886s):
Trace[22746264]: [9.003243307s] [9.00081496s] Transaction committed
Trace[22746264]: [9.956183886s] [950.672304ms] END
I1202 20:45:12.000414    8976 trace.go:116] Trace[683943658]: "Update" url:/api/v1/namespaces/kube-system/endpoints/kube-controller-manager,user-agent:k3s/v1.18.9+k3s1 (linux/arm64) kubernetes/630bebf/leader-election,client:127.0.0.1 (started: 2020-12-02 20:45:01.997261574 +0000 UTC m=+84.130693872) (total time: 10.002829094s):
Trace[683943658]: [10.002829094s] [10.002248863s] END

Your SD card is unable to support both workload and control-plane IO. It is overloaded to the point where datastore writes are taking > 10 seconds. After 10 seconds many internal Kubernetes components will exceed a hardcoded timeout and fatally terminate the process.

You might try adding some external storage for /var/lib/rancher - USB attached SSD is your best bet.

brandond on Dec 3, 2020