rancher: Cluster Stuck in Updating State when Nodes are Missing During Deletion

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

  • Start removal of several nodes
  • Delete nodes in control panel of hosting provider
  • Nodes are unreachable when trying to cleanup, rancher continues to try

Result: Nodes are unreachable cause they are already destroyed, so rancher state of cluster is stuck in updating state. Kubectl is also unusable.

Message displayed in UI: Failed to delete controlplane node [206.189.36.194] from cluster: Get https://104.248.150.110:6443/api/v1/nodes?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 206.189.36.194 was already removed outside of cluster.

Other details that may be helpful: logs:

[tiller] 2019/05/01 13:55:09 getting history for release cluster-monitoring
[storage] 2019/05/01 13:55:09 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:09 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[tiller] 2019/05/01 13:55:09 preparing update for cluster-monitoring
[storage] 2019/05/01 13:55:09 getting deployed releases from "cluster-monitoring" history
[storage/driver] 2019/05/01 13:55:09 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[storage] 2019/05/01 13:55:09 getting last revision of "cluster-monitoring"
[storage] 2019/05/01 13:55:09 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:09 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
2019/05/01 13:55:09 [ERROR] AppController p-kcxw8/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: UPGRADE FAILED: the server is currently unable to handle the request (get configmaps)

E0501 13:55:10.161169       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Secret: Get https://206.189.36.194:6443/api/v1/secrets?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0501 13:55:10.259878       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ConfigMap: Get https://206.189.36.194:6443/api/v1/configmaps?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0501 13:55:10.265748       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Namespace: Get https://206.189.36.194:6443/api/v1/namespaces?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[main] 2019/05/01 13:55:10 Starting Tiller v2.10+unreleased (tls=false)
[main] 2019/05/01 13:55:10 GRPC listening on :57772
[main] 2019/05/01 13:55:10 Probes listening on :37324
[main] 2019/05/01 13:55:10 Storage driver is ConfigMap
[main] 2019/05/01 13:55:10 Max history per release is 0
[tiller] 2019/05/01 13:55:10 getting history for release cluster-monitoring
[storage] 2019/05/01 13:55:10 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:10 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[tiller] 2019/05/01 13:55:10 preparing update for cluster-monitoring
[storage] 2019/05/01 13:55:10 getting deployed releases from "cluster-monitoring" history
[storage/driver] 2019/05/01 13:55:10 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[storage] 2019/05/01 13:55:10 getting last revision of "cluster-monitoring"
[storage] 2019/05/01 13:55:10 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:10 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
2019/05/01 13:55:10 [ERROR] AppController p-kcxw8/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: UPGRADE FAILED: the server is currently unable to handle the request (get configmaps)

E0501 13:55:11.203701       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.LimitRange: Get https://206.189.36.194:6443/api/v1/limitranges?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[main] 2019/05/01 13:55:11 Starting Tiller v2.10+unreleased (tls=false)
[main] 2019/05/01 13:55:11 GRPC listening on :33868
[main] 2019/05/01 13:55:11 Probes listening on :49504
[main] 2019/05/01 13:55:11 Storage driver is ConfigMap
[main] 2019/05/01 13:55:11 Max history per release is 0
[tiller] 2019/05/01 13:55:12 getting history for release cluster-monitoring
[storage] 2019/05/01 13:55:12 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:12 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[tiller] 2019/05/01 13:55:12 preparing update for cluster-monitoring
[storage] 2019/05/01 13:55:12 getting deployed releases from "cluster-monitoring" history
[storage/driver] 2019/05/01 13:55:12 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[storage] 2019/05/01 13:55:12 getting last revision of "cluster-monitoring"
[storage] 2019/05/01 13:55:12 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:12 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
2019/05/01 13:55:12 [ERROR] AppController p-kcxw8/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: UPGRADE FAILED: the server is currently unable to handle the request (get configmaps)

[main] 2019/05/01 13:55:13 Starting Tiller v2.10+unreleased (tls=false)
[main] 2019/05/01 13:55:13 GRPC listening on :39943
[main] 2019/05/01 13:55:13 Probes listening on :52559
[main] 2019/05/01 13:55:13 Storage driver is ConfigMap
[main] 2019/05/01 13:55:13 Max history per release is 0
[tiller] 2019/05/01 13:55:13 getting history for release cluster-monitoring
[storage] 2019/05/01 13:55:13 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:13 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[tiller] 2019/05/01 13:55:13 preparing update for cluster-monitoring
[storage] 2019/05/01 13:55:13 getting deployed releases from "cluster-monitoring" history
[storage/driver] 2019/05/01 13:55:13 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[storage] 2019/05/01 13:55:13 getting last revision of "cluster-monitoring"
[storage] 2019/05/01 13:55:13 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:13 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
2019/05/01 13:55:13 [ERROR] AppController p-kcxw8/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: UPGRADE FAILED: the server is currently unable to handle the request (get configmaps)

E0501 13:55:14.151898       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1beta2.Deployment: Get https://206.189.36.194:6443/apis/apps/v1beta2/deployments?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0501 13:55:14.329077       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ClusterRoleBinding: Get https://206.189.36.194:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[main] 2019/05/01 13:55:14 Starting Tiller v2.10+unreleased (tls=false)
[main] 2019/05/01 13:55:14 GRPC listening on :56057
[main] 2019/05/01 13:55:14 Probes listening on :38475
[main] 2019/05/01 13:55:14 Storage driver is ConfigMap
[main] 2019/05/01 13:55:14 Max history per release is 0
[tiller] 2019/05/01 13:55:14 getting history for release cluster-monitoring
[storage] 2019/05/01 13:55:14 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:14 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[tiller] 2019/05/01 13:55:15 preparing update for cluster-monitoring
[storage] 2019/05/01 13:55:15 getting deployed releases from "cluster-monitoring" history
[storage/driver] 2019/05/01 13:55:15 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[storage] 2019/05/01 13:55:15 getting last revision of "cluster-monitoring"
[storage] 2019/05/01 13:55:15 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:15 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
2019/05/01 13:55:15 [ERROR] AppController p-kcxw8/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: UPGRADE FAILED: the server is currently unable to handle the request (get configmaps)

E0501 13:55:15.153174       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.ConfigMap: Get https://206.189.36.194:6443/api/v1/namespaces/cattle-system/configmaps?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
[main] 2019/05/01 13:55:16 Starting Tiller v2.10+unreleased (tls=false)
[main] 2019/05/01 13:55:16 GRPC listening on :55566
[main] 2019/05/01 13:55:16 Probes listening on :45365
[main] 2019/05/01 13:55:16 Storage driver is ConfigMap
[main] 2019/05/01 13:55:16 Max history per release is 0
[tiller] 2019/05/01 13:55:16 getting history for release cluster-monitoring
[storage] 2019/05/01 13:55:16 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:16 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[tiller] 2019/05/01 13:55:16 preparing update for cluster-monitoring
[storage] 2019/05/01 13:55:16 getting deployed releases from "cluster-monitoring" history
[storage/driver] 2019/05/01 13:55:16 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
[storage] 2019/05/01 13:55:16 getting last revision of "cluster-monitoring"
[storage] 2019/05/01 13:55:16 getting release history for "cluster-monitoring"
[storage/driver] 2019/05/01 13:55:16 query: failed to query with labels: the server is currently unable to handle the request (get configmaps)
2019/05/01 13:55:16 [ERROR] AppController p-kcxw8/cluster-monitoring [helm-controller] failed with : failed to install app cluster-monitoring. Error: UPGRADE FAILED: the server is currently unable to handle the request (get configmaps)

E0501 13:55:17.286210       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Job: Get https://206.189.36.194:6443/apis/batch/v1/jobs?limit=500&resourceVersion=0&timeout=30s: context deadline exceeded
E0501 13:55:17.287148       6 reflector.go:134] github.com/rancher/norman/controller/generic_controller.go:175: Failed to list *v1.Endpoints: Get https://206.189.36.194:6443/api/v1/namespaces/cattle-prometheus/endpoints?limit=500&resourceVersion=0&timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): rancher/rancher:latest as of today
  • Installation option (single install/HA): single install

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Digital Ocean created cluster from Rancher
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): 2 2vCPU/2GB RAM
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Error from server (ServiceUnavailable): the server is currently unable to handle the request
  • Docker version (use docker version):
Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:43:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 5
  • Comments: 33 (5 by maintainers)

Most upvoted comments

I don’t recall anything in the docs saying that we should or shouldn’t do what with the etcd. Pardon me if I’m wrong, but I thought that the point of K8S (just as my understanding) is that nodes are kind of like a “temp resource” in that they can be created and destroyed at any time.

If we need to be careful and only allow a certain amount of nodes to have etcd and it’s quite dangerous to add and remove nodes with control planes, should we have a note section in the docs that say “be careful when adding nodes with control plane cause it will cause your cluster to not function anymore”?

I’m thinking most people will probably do what I did, start with one node, then add more nodes over time, and possibly remove some, so it would be good to have a solution to remove nodes that aren’t accessible anymore, or just auto drop them from control plane if not accessible after X amount of tries? Like 30?

Is there a need to guarantee that the control plane section from an unaccessible node to be deleted? Will that harm the cluster as a whole if it cannot delete it? (Pardon me if this sounds rude, but I"m just plan curious to find out about this, it doesn’t make sense to me, at a high level, but I’m letting Rancher control my cluster for now cause I haven’t had a moment to get in depth with K8s).

another workaround which worked, in case my previous workaround doesn’t work anymore

Situation I tried removing through the Node through the Rancher GUI, then my node was stuck in Removing "waiting for ... to come online; waiting for node controller"

Kubernetes itself showed the correct state and amount of nodes without the stuck node $ kubectl get nodes

Solution I fully removed the VM (in my case on vsphere) then…

# switch the kube-config to the local/management cluster
$ export KUBECONFIG="kube-config-rancher.yaml"

# find the ID of your stuck node within your cluster
$ CLUSTER_ID="c-8qc95"
$ kubectl get nodes.v3.management.cattle.io -A | grep $CLUSTER_ID
NAMESPACE   NAME            AGE
...
c-8qc95     m-4dlvs         15h
c-8qc95     m-slbjw         15h

# edit the specific node
$ kubectl edit nodes.v3.management.cattle.io/m-slbjw -n $CLUSTER_ID

now your console editor like vi opens. there, find the following line

  deletionTimestamp: "2021-04-21T10:15:22Z"
  finalizers:
  - controller.cattle.io/node-controller
  generateName: m-
  generation: 19

and change it to

  deletionTimestamp: "2021-04-21T10:15:22Z"
  finalizers: []
  generateName: m-
  generation: 19

save and close

=> your node will disappear in the GUI and your cluster will be back in healthy state

my workaround…

while my cluster was stuck with the same sympthom, showing “Removing” without any progress, I finally managed to get my cluster back into a healthy state.

  • I set the count of nodes to the desired amount (which started the mess in the first place)
  • fully removed the underlying VM with the according tools (e.g. “openstack server rm …”)
  • cluster still showing “Removing”
  • switch the kube-config to the local/management cluster $ export KUBECONFIG="kube-config-rancher.yaml"
  • edit the unhealthy cluster: $ kubectl edit cluster c-12345p
  • remove the unhealthy node from the node section
  nodes:
  - address: 1.2.3.4
    hostnameOverride: unhealthy-node-name
    ...

Looking through the github issues on here, I may be wrong, but it seems this problem happens a lot. I’m a bit nervous to run this in production for other apps, right now. I appreciate the help, don’t get me wrong, but this kind of problem is really causing delays for my work. The only reason I’m not just wiping my servers and starting over is that I wanted to grab some config settings but they are not reachable cause the cluster is stuck in the current state. Is there a way I can grab that info if there’s no fix?

Confirmed that we had this issue this morning - an aws node died and took the whole cluster with it. The restore snapshot trick seemed to work, but it feels like if a single node can take out the whole cluster - whats the point ¯_(ツ)_/¯

It is easy to reproduce this issue - create a 3 node cluster w/ all 3 nodes as controlplane, etcd, workers. Then in the aws console stop and start one of the nodes (causing it to change ip addresses)

The cluster will start spinning and will not recover w/o intervention

Discovered that today ending in deletion of cluster. Rancher 2.6.8, k3s Downstream RKE

Same issue here… nothing useful but EOF error messages in the rancher server log.

@HangingClowns I too didn’t read anything in the docs like that but obviously there is an issues. The restore method fixed it for me both times. I get that it’s a hard one to track down for the dev team, but there is a bug. I also think their focus is more on getting windows support right now, so I don’t foresee this getting fix soon. So long story short I won’t be scaling them down anytime soon.

@timzaak We used the restore snapshot in the menu on the far right. That removed the stuck node for us. Also after reading more it’s best to not scale your etcd. Ideally this issue get’s fixed so all of them can scale safely but until then. We have 5 nodes that have all roles and another pool with nodes that are just workers.

I tried adding a new node, now the cluster is stuck with this message:

"This cluster is currently Updating.

Failed to apply the ServiceAccount needed for job execution: Post https://4.4.4.4:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?timeout=30s: can not build dialer to c-46lkp:m-236de8f2ce5d"

The new node is stuck in “Registering” state. Perfect.