rancher: Runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

  • Deploy Rancher v2.2.8
  • Create custom cluster with default settings
  • Add node with missing firewall rules so node will only been partially added
  • remove node from Rancher UI and try to rejoin

Result: Runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialize

Other details that may be helpful: This looks very similar to https://github.com/rancher/rancher/issues/13484

Workaround Copy /etc/cni/net.d/10-canal.conflist and calico-kubeconfig from working node to new node. Then join node to cluster.

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.2.8
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): 2/4
  • Kubernetes version (use kubectl version):
v1.14.6-rancher1
  • Docker version (use docker version):
17.03.2-ce

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (12 by maintainers)

Most upvoted comments

For me, just some hours later and everything would be good.

@jiaqiluo @StrongMonkey Just saw this on @soumyalj’s setup as well, where the dockerInfo’s DockerRootDir for the host is temporarily empty, and the nodePlan thus passes wrong binds for the kubelet – node never registers. We have never hit this before, @sangeethah can we look into what specific OS or docker version is causing this behavior?

For RKE clusters provisioned with Rancher, there is a special handling for worker nodes. Node update/delete is handled by rancher agent, and is not included in the main rke reconcile loop. For the ref, that’s where this exclusion happens https://github.com/rancher/rancher/blob/master/pkg/controllers/management/clusterprovisioner/driver.go#L165.

On the node creation, rancher agent gets a node plan from rke/rancher, places the configuration on the node and start all the k8s components https://github.com/rancher/rancher/blob/master/pkg/controllers/user/noderemove/nodes.go#L30

On node removal, agent simply removes the v1.node, and then rancher kills agent container https://github.com/rancher/rancher/blob/master/pkg/controllers/management/node/controller.go#L255.

We have to add a cleanup part to the node removal logic on Rancher/agent side, and remove all the necessary configs before proceeding with the agent removal.

Keep in mind that the cleanup will be performed only when node can be contacted. If the node is unavailable at the moment of agent removal call, manual cleanup will be needed before the node could be re-registered with Rancher. As per https://rancher.com/docs/rancher/v2.x/en/cluster-admin/cleaning-cluster-nodes/.