rancher: Rancher agent occasionally restarts worker processes without service options causing pods to be restarted onto Docker Bridge network

Note: This is not a consistent issue and upgrade and getting assigned an incorrect IP can affect any pod. There is more details and a workaround described here: https://github.com/rancher/rancher/issues/23284#issuecomment-563464838

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible): Upgrade a rancher cluster with multiple coredns containers from rancher 2.2.8 to 2.3

Result: Some coredns containers fail to restart/upgrade automatically and the coredns-autoscalar restarts. Deleting the stale/old coredns containers brought dns back online cluster-wide

Other details that may be helpful: Occurred on 4 different clusters during the same upgrade process. Some coredns containers were restarted and fine while others were not. All containers within the clusters could no longer ping outside resources (rancher.com etc)

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.3.0
  • Installation option (single install/HA): single install

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): custom

  • Machine type (cloud/VM/metal) and specifications (CPU/memory): VM

  • Kubernetes version (use kubectl version): v1.14.6

  • Docker version (use docker version): 18.9.6

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 130 (22 by maintainers)

Most upvoted comments

Available to test with v2.3.6-rc3

We have been analyzing the logic path to achieve this scenario, and believe we have identified the cause of the service options not being set.

There was a major change in 2.3.x which introduces using a new tool called kontainer-driver-metadata to allow dynamically populating Kubernetes system images, service options, and addons information. kdm uses K8s/etcd (as the rest of Rancher does) to store its data, and as such, there is a minor logic fault in place where kdm does not fully handle all error cases.

Notably, the snippet in question is:

func getRKEServiceOption(name string, svcOptionLister v3.RKEK8sServiceOptionLister, svcOptions v3.RKEK8sServiceOptionInterface) (*v3.KubernetesServicesOptions, error) {
	var k8sSvcOption *v3.KubernetesServicesOptions
	svcOption, err := svcOptionLister.Get(namespace.GlobalNamespace, name)
	if err != nil {
		if !errors.IsNotFound(err) {
			return k8sSvcOption, err
		}
		svcOption, err = svcOptions.GetNamespaced(namespace.GlobalNamespace, name, metav1.GetOptions{})
		if err != nil {
			if !errors.IsNotFound(err) {
				return k8sSvcOption, err
			}
		}
	}
	if svcOption.Labels[sendRKELabel] == "false" {
		return k8sSvcOption, nil
	}
	logrus.Debugf("getRKEServiceOption: sending svcOption %s", name)
	return &svcOption.ServiceOptions, nil
}

The upstream logic that calls this command will set an “empty” service options struct (rather than nil) if Rancher is unable to pull the required CRD from either the lister (cache) or directly from the K8s api server. This type of behavior is to be expected in the two most common reported scenarios:

  1. Rancher is upgraded from 2.2.x to 2.3.x, and on startup, has not yet populated the KDM CRD’s (hence the node plans generated will contain empty service option data)
  2. Single-node Rancher is restarted (which restarts the embedded K3s api server + etcd) and potentially at startup Rancher is unable to pull the CRD data from the api server.

Both of these can be exasperated by slow K(8|3)s API servers, which can stem from slow disk/slow etcd.

This can be very easily reproduced by installing Kubernetes v1.16.3-rancher1-1 (set via Edit as YAML), then deleting the RKE service option that correlates to it: kubectl delete -n cattle-global-data rkek8sserviceoptions v1.16.3-rancher1-1 then waiting for the cattle-node-agent to refresh node plan (< 2 minutes).

/cc @kinarashah @deniseschannon

This is open since 2 months, many people are affected. Why isnt this fixed yet? Stop introducing new buggy features and focus on fixing open issues! No wonder everybody is switching away. Issue must be rather obvious if it happens with 2.3.x, some change that was introduced.

There are many solutions for this. As a general description of the problem, looks like external DNS resolution stopped working. Changing some DNS options, or adding a host a alias, or pointing to a different DNS works, but this needs to be done pod by pod.

The final solution is to just make the DNS work again for the entire cluster, but I’m not sure what is broken.

This is a critical bug, but I’m not sure if someone from the dev team is looking at this.

Do you have any update on this? Will version >= 2.3.3 fix the problem? We are struggling the same issue, installation on baremetal updated from 2.2.X -> 2.3.2 on Monday (04.11.2019). Failure happened several times already. Everytime we reset canal and coredns and then all pods to make it alive.

Same problem here. First, cattle pods started failing for not finding the cattle server. Then external-dns cannot resolve the acme servers. Something is definitely wrong with dns resolution after upgrade to 2.3.x.

@Oats87 Thanks for your hard work on this Chris, I appreciate it.

We experienced the same issue. Broken DNS resolution and in-cluster networking, Pods were assigned 172.xx and 10.42.xx IPs.

Setup

  • Rancher
    • Standalone
    • Upgraded from 2.0
    • v2.2.8 at start of timeline
  • K8s Cluster
    • Cluster ID: c-6gvwv
    • HA
    • Using RKE+Node Driver
    • Combined etcd+controlplane
    • Network Provider: Canal, Project Network Isolation enabled
    • v1.14.x at start of timeline

Outage Timeline

  • 30.10.
    • Rancher Upgrade to v2.3.2
    • Waited till Rancher cluster-agents were upgraded
    • Cluster Upgrade to v1.15.5-rancher1-2
    • After cluster upgrade finished, some pods were unable to access external services, some services/pods could not be reached by Ingress anymore.
    • To remediate the situation we deleted every single Pod (thats Part of Deployment/DaemonSet)
    • Networking was working again, every pod had 10.42.xx address
  • 13.11.
    • Rancher Container restarted multiple times (we assume disk issue causing etcd to fail)
    • Multiple Pods were recreated and were assigned 172.xx addresses, networking for these pods is unavailable.
    • We deleted all Pods with 172.xx addresses, all were assigned 10.42.xx addresses and networking resumed

Logs

Logs can be found in this gist.

Currently I have added:

We have the same problem, also upgraded from 2.2.X to the stable 2.3.2.

We have different clusters, but only some of them have problems:

  • ec2-cluster with only 1 machine, v1.14.8: no problems
  • ec2-cluster, v1.14.8: fails ~ once a day
  • ec2-cluster, v1.15.5: fails ~ once a day
  • eks-cluster: no problems
  • aks-cluster: no problems

So it seems to be unrelated to the k8s-version. I would like to avoid creating a completely blank rancher-instance, so I hope for 2.3.3. If I can help by providing logs or whatever else, I’d be happy to do so.

It is a very annoying problem, as we constantly need to check wether or not pods have been effected by the issue. Guessing the release will happen any moment now?

On Fri, Mar 27, 2020 at 10:41 AM iori notifications@github.com wrote:

@superxor https://github.com/superxor We’re hoping to release later this week.

As for this issue, this is not reliably reproducible but closing as part of this release with the fixes we’ve identified. If it comes up again, we can investigate further.

@deniseschannon https://github.com/deniseschannon hi, when 2.3.6 release, i’m crazy with this problem (k8s 1.17.4 docker 19.3.5).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rancher/rancher/issues/23284#issuecomment-604906858, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHUCXTUG67TIZYO7KDMFGLRJRYDHANCNFSM4I6W4RCQ .

Not sure this is really relates, but this might be of interest to some: Since updating kubernetes to 1.17.3 we haven’t had any issue with wrong pod-ips for about 5-6 days. Reading up on the kubernetes-changelog shows there are a couple of pod-ip related issues/regressions fixed in 1.17.3.

@Oats87 good to see some progress. 2 month ago I said " if it happens with 2.3.x, some change that was introduced." but seems nobody took a closer look. There’s only so much non-dev can do here. Blaming it on an upstream issue was a cop-out. Well thanks, I hope this is fixed soon

@Oats87 yes, we have an air gap installation of rancher. Rancher is not able to reach github.com.

Regarding logs, I’m currently clarifying that internally.

What a bummer! Thank you very much @a-yiorgos ! I totally missed that. Only checked the supported version for rancher and not explicitly for kubernetes.

I’ll downgrade docker and check if the issue does not appear anymore.

@michaellechnertwint the docker version you are running rancher (the UI) on is irrelevant. In our case the Rancher UI runs on docker-19. This has nothing, nothing to do with the behavior your Kubernetes cluster is experiencing. All your Kubenetes nodes should be at one of the supported versions, as those are documented by the Kubernetes release notes. For the 1.16.x series, the highest is 18.09. The latest 18.09 is 18.09.9. Install this on all your Kubernetes nodes.

What you are copying to @Oats87 is what we were seeing. This command change occurs when the CNI fails to assign an IP address to the pod, and thus the pod gets assigned an address from 172.17.0.x/16 after the kubelet restart.

I am writing this to make it clear for others who may not have much experience with Rancher and Kubernetes. This is not a problem of the UI. This is something that may happen (and has happened to us) with a kubeadm launched cluster. So people need to be very careful (to avoid confusion) when we talk about docker versions. It is one thing the docker version that you are running the rancher/rancher container (and this can be docker-19) and a totally different thing what your Kubernetes cluster is on (this can be up to 18.09.9 for 1.16.x).

So please, I am copying @superxor from above, use 18.09.9 for your Kubernetes clusters and check again whether this bites you or not. It helped in our case.

@Oats87 backup script: backup.txt docker-compose: docker-compose.txt

Unfortunately I can’t provide you the full kubelet and rancher logs due to security reasons. But if you can say where I should put an eye on, I can post some snippets. We are not using a loadbalancer in front of rancher, only for k8s nodes.

What I found in the cattle-node-agent logs for this affected node is interesting. Looks like the command changed somehow! Exactly at this time kubelet was restarted. It appears only in the timerange where kubelet was restartet:

time="2020-02-01T02:01:12Z" level=info msg="For process kube-proxy, Command has changed from [/opt/rke-tools/entrypoint.sh kube-proxy --v=2 --healthz-bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --hostname-override=node-03 --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-proxy.yaml] to [/opt/rke-tools/entrypoint.sh kube-proxy --cluster-cidr=10.42.0.0/16 --hostname-override=node-03 --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-proxy.yaml]" time="2020-02-01T02:01:13Z" level=info msg="For process kubelet, Command has changed from [/opt/rke-tools/entrypoint.sh kubelet --read-only-port=0 --v=2 --cgroups-per-qos=True --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-node.yaml --volume-plugin-dir=/var/lib/kubelet/volumeplugins --address=0.0.0.0 --cni-conf-dir=/etc/cni/net.d --cluster-dns=10.0.0.10 --fail-swap-on=false --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --make-iptables-util-chains=true --streaming-connection-idle-timeout=30m --authentication-token-webhook=true --resolv-conf=/etc/resolv.conf --cloud-provider= --cluster-domain=cluster.local --root-dir=/var/lib/kubelet --anonymous-auth=false --event-qps=0 --cni-bin-dir=/opt/cni/bin --network-plugin=cni --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --hostname-override=node-03 --pod-infra-container-image=registry/rancher/pause:3.1 --authorization-mode=Webhook] to [/opt/rke-tools/entrypoint.sh kubelet --cluster-domain=cluster.local --root-dir=/var/lib/kubelet --pod-infra-container-image=registry/rancher/pause:3.1 --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cloud-provider= --cluster-dns=10.0.0.10 --fail-swap-on=false --hostname-override=node-03 --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-node.yaml]" time="2020-02-01T02:01:14Z" level=info msg="Starting plan monitor" time="2020-02-01T02:03:14Z" level=info msg="For process kube-proxy, Command has changed from [/opt/rke-tools/entrypoint.sh kube-proxy --cluster-cidr=10.42.0.0/16 --hostname-override=node-03 --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-proxy.yaml] to [/opt/rke-tools/entrypoint.sh kube-proxy --cluster-cidr=10.42.0.0/16 --hostname-override=node-03 --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-proxy.yaml --healthz-bind-address=127.0.0.1 --v=2]" time="2020-02-01T02:03:18Z" level=info msg="For process kubelet, Command has changed from [/opt/rke-tools/entrypoint.sh kubelet --cluster-domain=cluster.local --root-dir=/var/lib/kubelet --pod-infra-container-image=registry/rancher/pause:3.1 --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --cloud-provider= --cluster-dns=10.0.0.10 --fail-swap-on=false --hostname-override=node-03 --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-node.yaml] to [/opt/rke-tools/entrypoint.sh kubelet --streaming-connection-idle-timeout=30m --cluster-dns=10.0.0.10 --fail-swap-on=false --cgroups-per-qos=True --network-plugin=cni --volume-plugin-dir=/var/lib/kubelet/volumeplugins --authentication-token-webhook=true --event-qps=0 --cloud-provider= --read-only-port=0 --resolv-conf=/etc/resolv.conf --make-iptables-util-chains=true --v=2 --cni-conf-dir=/etc/cni/net.d --cluster-domain=cluster.local --hostname-override=node-03 --pod-infra-container-image=registry/rancher/pause:3.1 --address=0.0.0.0 --cni-bin-dir=/opt/cni/bin --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/ssl/kube-ca.pem --kubeconfig=/etc/kubernetes/ssl/kubecfg-kube-node.yaml --root-dir=/var/lib/kubelet --anonymous-auth=false]"

@a-yiorgos we are using docker 19.3.5 for server. only docker-client is 18.09.06, but that should not matter I think.

@michaellechnertwint we had been bitten by this issue for almost a year and depending the kubernets version, the issue manifested itself from once a week to daily. Finally, after many random tweaks, what seems to have settled the situation is the docker version. We had even started reading k8s code to figure out when the pause container is launched and with what network setup.

What seemed to settle the dust, was installing on our workers docker-18.09.9 (we are on bare metal, ubuntu 18.04). Ubuntu comes with 18.09.7. We saw at the release notes of 18.09.9 and the description of the networking issues matched our experience.

Ever since then, we are running flawlessly, with the exception of a single day where I killed all the calico pods and one node did not restart calico fast enough, thus the containers got a 172.17.0.x/16 address instead of a calico assigned one.

So could it be your docker version?

@michaellechnertwint are you able to provide me your backup script? Is your cluster a node driver cluster?

Do you have the ability to provide me your kubelet logs from the node where you ran into this problem? Also, the logs from the Rancher container itself would be very helpful. Do you have a load balancer in front of Rancher?

The logs can be grabbed by docker logs <container> &> <container>.log where <container> is the container in question.

Hi @Oats87 We’re facing the same IP issue, which pod’s get wrong IPs from docker0 bridge instead of the overlay network from flannel. We have single-node install of rancher (2.3.3) and one custom Kubernetes cluster on VM’s. Kubernetes Cluster is on version 1.16.3 with 7 nodes (3 masters, 4 workers).

Last Saturday morning we faced the issue on one worker-node again. Kubelet was unexpectedly restarted on this node. We do rancher backup at 3 AM which stops rancher for a short time. The kubelet restart on this worker-node was right after that. Could that be related? In the kubelet log I can not see something that explains why kubelet was restarted. After that all pods were restarted and a wrong ip-address from docker0 bridge was assigned to all pods on this node.

@superxor What is your most recent exact configuration of Rancher + Docker + K8s? We are working to try and reproduce this internally.

Ok we have the issue again. After a fresh rancher and cluster setup. Seems there is no fix

@rossigee since we downgraded docker to supported version we have no issues (so far). 19.03 is not officially supported. We run 18.09 now. Try it

@pcasis it’s a coincidence, our failures are never during backup times

@leodotcloud

are getting attached to the docker bridge network during the upgrade window

upgrade of cni plugin? or k8s. If it’s the former, will it happen if one of the cni containers is down - assuming there is a default strategy on the daemon set - or only when all of them are down

Hi @alena1108 No we did not set up anything special configuration for coredns, neither for anything else in the system namespace. AFAICT it is a standard EC2 setup. coredns uses the image rancher/coredns-coredns:1.3.1

@alena1108 In our case we did not upgrade to the latest k8s version. It is the same case as my colleague @mariusburfey described:

before upgrade:

  • Cluster is EC2 with RKE and Kubernetes v1.14.8
  • Rancher server 2.2.8 as single node install

after upgrade:

  • Upgraded to rancher server 2.3.2
  • Clusters not touched

behaviour:

  • We had some issues directly after upgrading, which could be resolved by redeploying pods.
  • Now about once per day we experience the DNS issues which we can temporarily resolve by redeploying all coredns pods.
  • we cannot see any strange logs in either the coredns pods neither the custom workloads. All errors resolve around dns issues (cannot connect to database etc).
  • It does not happen for our managend k8s clusters (azure and eks)

Please let us know if we can provide any further info. Thanks for your efforts, they are very much appreciated!

Here are the details on my cluster

Setup originally with Rancher 2.3.0-alpha5 / Kubernetes 1.15 in late June. Rancher is a standalone setup. Upgraded to Rancher 2.3.0 when it shipped Upgraded to Rancher 2.3.1 when it shipped Upgraded to Rancher 2.3.2 when it shipped and updated the cluster nodes Kubernetes to 1.16 on the same day I first noticed the issues about 2-3 days after the 2.3.2/1.16 upgrade when I deployed a new version of a workload and found DNS wasn’t working. I would restart the CoreDNS pod to resolve this. Repeated this 2-3 times over the next week then redeployed all my 172.17.x.x workloads on November 4. They’ve been stable since then. I’m pretty sure it’s fixed but I still watch it carefully.

As far as cluster customization. I have the vSphere options defined (in-tree provider) and vsphere persistent disks in use.

The only other customizations I have are

In my node template disk.enableUUID = true

and in my cluster config

extraBinds

  • /usr/libexec/kubernetes/kubelet-plugins/volume/exec:/usr/libexec/kubernetes/kubelet-plugins/volume/exec

I put that second one in when I was having issues getting vSphere persistent disks working. In retrospect, I’m pretty sure this is not needed but once I got it working I left it in place.

@alena1108 , I don’t have the logs anymore, but I had a 2.2.8 cluster, created with Rancher (single node) on AWS EC2. I kept the rancher data folder on an external volume, so I just upgrade the image to the latest tag. After a few days I started noticing problems deploying lets-encrypt certificates, and the logs started showing failure on dns-resolution for cert-manager.

I didn’t do any modifications on rancher or any k8s service.

Solved this issue with removing line: options ndots:5 from /etc/resolv.conf

This is extremely relevant for my company. It has caused major downtimes of production systems. We will have to migrate away from RKE to a managed kubernetes in the next days because of this.