rancher: When control plane becomes unavailable (host powered down) when there is another control plane in the cluster , not able to deploy nodes in the worker plane and other control plane.

**Rancher versions:v2.0.2

Steps to Reproduce: Create a cluster with following node configurations: 1 control (n1) 1 etcd (n2) 1 worker (n3)

Add 1 more control node (n4)

Power down control node - n1.

Wait for the node to be marked “unavailable”.

Try to create a daemon set. 3 pods get created out of which only 1 pod was able to start sucessfully which is on the new control node. There is an attempt made to start a pod in the worker node that fails with following error:

 Normal   SuccessfulMountVolume   5m                  kubelet, ip-172-31-3-155  MountVolume.SetUp succeeded for volume "default-token-mqlnn"
  Normal   SandboxChanged          4m (x12 over 5m)    kubelet, ip-172-31-3-155  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  14s (x100 over 5m)  kubelet, ip-172-31-3-155  Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "hellotest-qjr5t_default" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: getsockopt: no route to host

Other pod is attempted to start in the control node that is unavailable.

Deploying a pod with some scale , results in all the pods getting deployed on the worker node and it get stuck in “ContainerCreating” state.

NAME                      READY     STATUS              RESTARTS   AGE
hello1-74f74757b9-5swtv   0/1       ContainerCreating   0          39m
hello1-74f74757b9-8rdp2   0/1       ContainerCreating   0          39m
hello1-74f74757b9-gz72c   0/1       ContainerCreating   0          39m
hello1-74f74757b9-xql8l   0/1       ContainerCreating   0          39m

Results:

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 4
  • Comments: 20 (1 by maintainers)

Most upvoted comments

I have encountered somewhat similar problems, but with different symptoms.

Steps to reproduce Create a cluster with following topology:

  • 3 etcd
  • 2 controlplane
  • 2 worker nodes

Provider: Openstack

Kill one controlplane openstack cloud vm, e.g master01

Expected results

master01 appears unhealthy in Rancher.

All node services (kubelet, kube-proxy, nginx-proxy) reconnect gracefully to the second master02.

Actual results

In Rancher UI cluster nodes page, multiple hosts appear unhealthy:

  • master01 (obviously)
  • some etcd nodes (e.g etc01)
  • some worker nodes (e.g worker01)

services impacted below are those running on the nodes above.

CNI daemons (e.g flannel) lose connection to api server and cannot configure/update the pod network overlay. This should actually use the iptables dnat entry for kubernetes:https service ip, but for some reason it won’t do the load distribution using iptables random probability rules towards the healthy api servers. Also, the list of kubernetes api services won’t get updated correctly in iptables by kube-proxy (see below) to remove the unhealthy entries.

Pods that need to speak to api server cannot do anymore via kubernetes service ip.

kube-proxy daemon loses connection to api server (using nginx 127.0.0.1:6443)

kubelet loses connection to api server (using nginx 127.0.0.1:6443)

nginx-proxy daemon cannot speak anymore to kubernetes api. It seems stuck to unhealthy master01, causing all the above daemons to not be able to reach api anymore.

So what we experience is a global outage caused by missing a single master api server with a cluster that should be HA.

I don’t understand why this issue was closed? It seems above there are many people who are experiencing these issues.

I think this might only affect bare-metal installations.

@deniseschannon Is there an update on this? To me, this really makes Rancher 2 NOT production-ready.

I have 3 nodes running all - worker,etcd and controlplan. If I take a node out, everything becomes unavailable. Not just Rancher UI, API, kubectl - but also the load because the Nginx ingress needs to refresh its config and can’t…