rancher: Rancher server crashed with error - leaderelection lost for cattle-controllers

Rancher server version - Build from master

Rancher server crashed when cluster provisioning related scenarios were attempted:

2018/05/11 21:26:28 [INFO] Handling backend connection request [m-sxsdg]
E0511 21:26:31.813321       1 streamwatcher.go:109] Unable to decode an event from the watch stream: tunnel disconnect
E0511 21:26:32.072977       1 reflector.go:315] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:129: Failed to watch *v1.Node: Get https://172.31.4.120:6443/api/v1/watch/nodes?resourceVersion=2547&timeoutSeconds=552: tunnel disconnect
E0511 21:26:32.081724       1 reflector.go:315] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:129: Failed to watch *v1.Secret: Get https://172.31.4.120:6443/api/v1/watch/secrets?resourceVersion=2101&timeoutSeconds=464: tunnel disconnect
E0511 21:26:32.414349       1 reflector.go:315] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:129: Failed to watch *v1.Secret: Get https://13.59.193.167:6443/api/v1/watch/secrets?resourceVersion=672&timeoutSeconds=525: waiting for cluster agent to connect
2018/05/11 21:26:34 [INFO] Handling backend connection request [m-ba845fb607be]
2018/05/11 21:26:34 [INFO] Handling backend connection request [m-d981f39d06e3]
2018/05/11 21:26:34 [INFO] Handling backend connection request [m-qn2rx]
2018/05/11 21:26:34 [INFO] Handling backend connection request [m-gn7lm]
2018/05/11 21:26:34 [INFO] Handling backend connection request [m-8d5a15be1dce]
2018/05/11 21:26:34 [INFO] Handling backend connection request [m-w7v62]
2018/05/11 21:26:26 [INFO] Handling backend connection request [m-f4717e1c7a42]
2018/05/11 21:26:35 [INFO] stdout: (test-17704) Waiting for IP address to be assigned to the Droplet...
2018/05/11 21:26:36 [INFO] Handling backend connection request [c-mg24m]
2018/05/11 21:26:36 [INFO] Handling backend connection request [c-9xpvw]
2018/05/11 21:26:36 [INFO] Handling backend connection request [c-xzspz]
2018/05/11 21:26:31 [INFO] Handling backend connection request [m-b9njt]
2018/05/11 21:26:39 [ERROR] netpolMgr: program: error updating network policy err=Put https://138.197.108.114:6443/apis/networking.k8s.io/v1/namespaces/default/networkpolicies/hn-nodes: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
E0511 21:26:32.443595       1 reflector.go:315] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:129: Failed to watch *v1.ClusterRole: Get https://13.59.193.167:6443/apis/rbac.authorization.k8s.io/v1/watch/clusterroles?resourceVersion=665&timeoutSeconds=406: waiting for cluster agent to connect
E0511 21:26:35.397525       1 writers.go:139] apiserver was unable to write a JSON response: http: Handler timeout
E0511 21:26:27.416041       1 event.go:260] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' '402bb98070f9 stopped leading'
E0511 21:26:40.190817       1 writers.go:139] apiserver was unable to write a JSON response: http: Handler timeout
E0511 21:26:40.966149       1 runtime.go:66] Observed a panic: &errors.errorString{s:"kill connection/stream"} (kill connection/stream)
/go/src/github.com/rancher/rancher/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/rancher/rancher/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/rancher/rancher/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:509
/usr/local/go/src/runtime/panic.go:491
/go/src/github.com/rancher/rancher/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:230
/go/src/github.com/rancher/rancher/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:114
/go/src/github.com/rancher/rancher/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:45
/usr/local/go/src/net/http/server.go:1918
/go/src/github.com/rancher/rancher/vendor/k8s.io/apiserver/pkg/endpoints/request/requestcontext.go:110
/usr/local/go/src/net/http/server.go:1918
/go/src/github.com/rancher/rancher/vendor/k8s.io/apiserver/pkg/server/filters/wrap.go:41
/usr/local/go/src/net/http/server.go:1918
/go/src/github.com/rancher/rancher/vendor/k8s.io/apiserver/pkg/server/handler.go:198
/usr/local/go/src/net/http/server.go:2619
/usr/local/go/src/net/http/server.go:1801
/usr/local/go/src/runtime/asm_amd64.s:2337
E0511 21:26:41.232679       1 cronjob_controller.go:113] can't list Jobs: the server was unable to return a response in the time allotted, but may still be processing the request (get jobs.batch)
2018/05/11 21:26:41 [ERROR] netpolMgr: handleHostNetwork: error programming hostNetwork network policy for ns=default err=Put https://138.197.108.114:6443/apis/networking.k8s.io/v1/namespaces/default/networkpolicies/hn-nodes: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/05/11 21:26:41 [FATAL] leaderelection lost for cattle-controllers

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 21 (4 by maintainers)

Commits related to this issue

Use Lister for Template Verion and Content Related issue: https://github.com/rancher/rancher/issues/13446 It should use local cache instead of API call for listing templateVersions and templateConten... — committed to rancher/rancher by orangedeng 6 years ago

Most upvoted comments

I can verify the same behaviour with Rancher 2.1.1 is just annoying. This issue should be reopened.

Darkeye9 on Nov 24, 2018

First, sorry for only (semi-raging) about this. But expanding on details, I am seeing this behaviour on OVH Public Cloud, with recommended Ubuntu 16.04.5 LTS, docker 17.03.2-ce and kernel 4.15.0-39-generic.

The crashing message seems to be different each time, but always of the type “Could not construct reference to …”

This happens, in my case (but I see similar log lines in other’s logs) after a series of etcd high update latencies, and other related timeouts communicating with some internal microservice. Ex:

etcdserver: apply entries took too long [26.024302719s for 1 entries]
[...]
 status.go:64] apiserver received an error that is not an metav1.Status: etcdserver: request timed out
[...]
reflector.go:205] github.com/rancher/rancher/vendor/github.com/rancher/norman/controller/generic_controller.go:144: Failed to list *v3.ComposeConfig: Get https://127.0.0.1:6443/apis/management.cattle.io/v3/composeconfigs?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[...]
writers.go:149] apiserver was unable to write a JSON response: write tcp 127.0.0.1:6443->127.0.0.1:54228: write: broken pipe

So it seems mainly, some high latencies with etcd (26s to update 1 record???). I saw some similar messages from other users, but never that high! Is that normal? And the other (possibly related) issue is that the microservice served at port 6443 lags way too much and causes the timeouts of all the other components. So much that crashes the main process? (This should never happen, c’mon Go…)

I hope this findings to be useful for someone, and I am willing to help, but I am a Kubernetes newbie and I do not know so much about etcd, expected behaviours and so on…

Darkeye9 on Nov 26, 2018

I am also experiencing the same issue from 2.0.6 to 2.0.8, it works for about ~15 to 30 minutes and then restart with this error.

E0828 17:22:11.174794       7 leaderelection.go:224] error retrieving resource lock kube-system/cattle-controllers: Get https://10.33.0.1:443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0828 17:22:41.175165       7 leaderelection.go:224] error retrieving resource lock kube-system/cattle-controllers: Get https://10.33.0.1:443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0828 17:22:41.175199       7 event.go:260] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'rancher-544d47bf7f-wp4vm stopped leading'
I0828 17:22:41.175308       7 leaderelection.go:203] failed to renew lease kube-system/cattle-controllers: timed out waiting for the condition
2018/08/28 17:22:41 [FATAL] leaderelection lost for cattle-controllers

mvisonneau on Aug 28, 2018