rancher: Unable to provision harvester rke2 cluster from rancher, stuck in provisioning on bootstrap node

Rancher Server Setup

  • Rancher version: v2.6-head (1d2b746)
  • Installation option (Docker install/Helm Chart): Docker

Information about the Cluster

  • Kubernetes version: v1.22.6+rke2r1
  • Cluster Type (Local/Downstream): Infrastructure Provider - Harvester

User Information

  • What is the role of the user logged in? Admin

Describe the bug

Unable to provision RKE2 cluster from rancher v2.6-head, keep in Provisioning status with message This issue exists on both RKE2 1.22.7+rke2r1 and 1.21.10+rke2r1

Non-ready bootstrap machine(s) rke2-ubuntu-pool1-bcfd5cfbb-2pj9w and join url to be available on bootstrap node

image

Further checking cluster node status in RKE2 VM, it looks like node have taints to prevent component to finish deployment.

# ssh to rke2 virtual machine

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
/var/lib/rancher/rke2/bin/kubectl get nodes -o yaml
spec:
    podCIDR: 10.42.0.0/24
    podCIDRs:
    - 10.42.0.0/24
    taints:
    - effect: NoSchedule
      key: node.cloudprovider.kubernetes.io/uninitialized
      value: "true"

To Reproduce

  1. Create a one node or three nodes harvester cluster
  2. Prepare SLES JeOS image SLES15-SP3-JeOS.x86_64-15.3-OpenStack-Cloud-GM.qcow2
  3. Enable virtual network to harvester-mgmt
  4. Create virtual network vlan1 with id 1
  5. Import harvester in rancher v2.6-head
  6. Create cloud credential
  7. Provision a RKE2 cluster with SLES JeOS image
    image
  8. Check the RKE2 cluster provisioning to Ready

Result Provisioning stuck in Non-ready bootstrap machine(s) rke2-ubuntu-pool1-bcfd5cfbb-2pj9w and join url to be available on bootstrap node

Expected Result

Can provision RKE2 cluster in harvester correctly in acceptable time for major OS image (e.g SLES, Ubuntu)

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 34 (5 by maintainers)

Commits related to this issue

Most upvoted comments

Closing it as it was supposed to be closed a few days ago and got reopened due to Zube requiring “Done” label on issues. @TachunLin, I see a comment above by @lanfon72 that this may still be an issue or there is another related issue, please handle it as necessary.

According to what I learnt (thanks @FrankYang0529 ), the harvester-cloud-provider is the one removing the taint node.cloudprovider.kubernetes.io/uninitialized once it considers the node is ready. That provider fails with the error:

node_controller.go:390] Initializing node rke2-in-jeos-pool1-51a1e43e-fm4kk with cloud provider
node_controller.go:212] error syncing 'rke2-in-jeos-pool1-51a1e43e-fm4kk': failed to get instance metadata for node rke2-in-jeos-pool1-51a1e43e-fm4kk: Get "https://192.168.0.131:6443/apis/kubevirt.io/v1/namespaces/default/virtualmachines/rke2-in-jeos-pool1-51a1e43e-fm4kk": dial tcp 192.168.0.131:6443: i/o timeout, requeuing

And thus it is unable to remove the taint.

I am confused about why it requires the CNI plugin to be running. The url https://192.168.0.131:6443/apis/.... looks like a call to kube-api and kube-api is running in hostNetwork: true, i.e. it does not require a cni plugin to run. Could you verify if 192.168.0.131 is the IP where kube-api is running?

Let me ask around. Before adding that toleration, I want to make sure that it will not introduce regression or other problems

@thedadams, it looks like the root cause is Calico can’t be installed. If we change CNI to Canal, RKE2 1.22.7+rke2r1 can be installed.

In RKE2 1.22.7+rke2r1, it upgrades tigera-operator to v1.23.5. Also, tigera-operator updates tolerations in calico-typha. Originally, in tigera-operator v1.17.6, it use tolerations for calico-typha like following, so we can install RKE2 1.21.10+rke2r1 without error.

      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists

However, in tigera-operator v1.23.5, it adds keys to tolerations like following, so calico-typha can’t tolerate node.cloudprovider.kubernetes.io/uninitialized.

      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
        operator: Exists
      - effect: NoExecute
        key: node-role.kubernetes.io/etcd
        operator: Exists

@TachunLin The SLES issue seems to be unrelated to the issue with Ubuntu. Rancher is not responsible for installing the apparmor-parser, so the SLES issue is understood (the installation of apparmor-parser could be added to rancher-machine when the SLES operating system is detected).

For the Ubuntu image case, the nodes are tainted as NoSchedule because the cloud-provider (Harvester, in this case) isn’t setup on the node. Has it been verified that the Harvester cloud-provider is running successfully on these nodes? And has the Harvester cloud-provider been tested on Kubernetes v1.22+ outside of Rancher provisioning?