rancher: cluster agent is not ready error seen when cluster does not have a node with worker role

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible): Create Cluster

  • Add Cluster - Existing Nodes
  • Name it and hit next accepting the defaults
  • Node options/roles: select etcd and control plane (no worker)
  • Done

Add First Node

Result:

  • State: ERROR
  • Msg: Cluster health check failed: cluster agent is not ready

Other details that may be helpful: I have tried this a number of times using different rancher docker tags and it continues to fail. The last version that seems to work for me was v2.4.8

I thought this was related to issue 29652 but I tried master-head later and it still failed. Also tried supplying a certificate (from the issue) and it still failed.

I can easily reproduce this so if someone wants me to try a different docker tag or gather logs, I can do so.

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): latest, v2.5.1, v2.5-rc3, master-head(as of 10/21)
  • Installation option (single install/HA): Docker Rancher, attempting to create HA cluster
  • Ubuntu 18.04
  • ufw inactive

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom/Existing Nodes
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): VM
  • Kubernetes version (use kubectl version): Whatever is bundled w/ the Rancher version (v1.19.3)
  • Docker version (use docker version):19.03.6

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 37 (16 by maintainers)

Most upvoted comments

@superseb I’m not sure if you saw my message earlier, but after that I carried on trying different things. I’m happy to say that all issues are gone if I use calico

So another summary of what I’ve tried to so far:

  1. One node with all roles always works
  2. Canal using default network interface also works
  3. Canal using private network interface never works
  4. Calico works (provided I add some crazy firewall rules to allow DNS resolution from control plane to worker)

Having now spent many days on this issue I’m even more convinced that control plane nodes should be able to be healthy without depending on others. My expectation would be that the cluster is healthy once the first node is up, and then the state of the cluster is in “waiting” state until all roles are present. Using kubectl during this early stage would be so helpful!!

Introducing DNS resolution that depends on the existence and stable connection with at least one worker feels like it’s at the bottom of all the problems I’ve been experiencing. Not least because a control plane that waits for the first worker stays with the “cluster agent is not ready” error, even while the worker is in a registering state. Debugging is just nearly impossible (e.g.: I still don’t know what the issue with flannel is). My favourite CNI is Cilium but given this experience I wouldn’t even dare to start a cluster without any CNI deployed.

Also related to the use case of hardened clusters like mine, external firewall configurations can be extremely difficult with overlays. Delegating DNS resolution to one is calling for even more trouble.

the more I learn about rancher the more I like it! Thanks for the explanation on the nginx load balancer. I used to have to do that myself outside the cluster and that’s neat

I looked at the nginx config and the IP of the control plane was indeed correct. This was my fault as I created this cluster without specifying that nodes should be using the private network for internal purposes and the firewall rules didn’t match. Apologies for not having seen this earlier

After applying a fix on the firewall the worker has joined, which brings this cluster to the same bad state as the one I reported initially. There is one worker, and one control plane/etcd node. Both have registered correctly. But the cluster does not go healthy because of the cluster agent is not ready error as per my screenshot

So to summarise:

  1. If I register a node with all roles, cluster becomes healthy
  2. If I register a control_plane node and then a worker, both join OK but I get cluster agent is not ready error. Cluster is not healthy and so I can’t inspect with kubectl, etc.
  3. If I register a worker node and then the control plane, I get the same behaviour as in (2)