rancher: Unable to provision harvester rke2 cluster from rancher, stuck in provisioning on bootstrap node

Rancher Server Setup

Rancher version: v2.6-head (1d2b746)
Installation option (Docker install/Helm Chart): Docker

Information about the Cluster

Kubernetes version: v1.22.6+rke2r1
Cluster Type (Local/Downstream): Infrastructure Provider - Harvester

User Information

What is the role of the user logged in? Admin

Describe the bug

Unable to provision RKE2 cluster from rancher v2.6-head, keep in Provisioning status with message This issue exists on both RKE2 1.22.7+rke2r1 and 1.21.10+rke2r1

Non-ready bootstrap machine(s) rke2-ubuntu-pool1-bcfd5cfbb-2pj9w and join url to be available on bootstrap node

Further checking cluster node status in RKE2 VM, it looks like node have taints to prevent component to finish deployment.

# ssh to rke2 virtual machine

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml
/var/lib/rancher/rke2/bin/kubectl get nodes -o yaml

spec:
    podCIDR: 10.42.0.0/24
    podCIDRs:
    - 10.42.0.0/24
    taints:
    - effect: NoSchedule
      key: node.cloudprovider.kubernetes.io/uninitialized
      value: "true"

To Reproduce

Create a one node or three nodes harvester cluster
Prepare SLES JeOS image SLES15-SP3-JeOS.x86_64-15.3-OpenStack-Cloud-GM.qcow2
Enable virtual network to harvester-mgmt
Create virtual network vlan1 with id 1
Import harvester in rancher v2.6-head
Create cloud credential
Provision a RKE2 cluster with SLES JeOS image
Check the RKE2 cluster provisioning to Ready

Result Provisioning stuck in Non-ready bootstrap machine(s) rke2-ubuntu-pool1-bcfd5cfbb-2pj9w and join url to be available on bootstrap node

Expected Result

Can provision RKE2 cluster in harvester correctly in acceptable time for major OS image (e.g SLES, Ubuntu)

Additional context

Harvester support bundle supportbundle_74b92604-c442-4d1d-a70d-7313e32ac47c_2022-03-02T13-14-35Z.zip
Rancher container log rancher2.6-head_rke2_sles.log

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 34 (5 by maintainers)

Commits related to this issue

Bump harvester-cloud-provider 0.1.10 Related issue: https://github.com/rancher/rancher/issues/36716 https://github.com/harvester/harvester/issues/1999 Signed-off-by: yaocw2020 <yaocanwu@gmail.com> — committed to yaocw2020/rke2 by yaocw2020 2 years ago
Bump harvester-cloud-provider 0.1.10 Related issue: https://github.com/rancher/rancher/issues/36716 https://github.com/harvester/harvester/issues/1999 Signed-off-by: yaocw2020 <yaocanwu@gmail.com> — committed to yaocw2020/rke2 by yaocw2020 2 years ago
Bump harvester-cloud-provider 0.1.10 Related issue: https://github.com/rancher/rancher/issues/36716 https://github.com/harvester/harvester/issues/1999 Signed-off-by: yaocw2020 <yaocanwu@gmail.com> — committed to yaocw2020/rke2-charts by yaocw2020 2 years ago
Bump harvester-cloud-provider 0.1.10 Related issue: https://github.com/rancher/rancher/issues/36716 https://github.com/harvester/harvester/issues/1999 Signed-off-by: yaocw2020 <yaocanwu@gmail.com> — committed to yaocw2020/rke2-charts by yaocw2020 2 years ago
Bump harvester-cloud-provider 0.1.10 Related issue: https://github.com/rancher/rancher/issues/36716 https://github.com/harvester/harvester/issues/1999 Signed-off-by: yaocw2020 <yaocanwu@gmail.com> — committed to rancher/rke2-charts by yaocw2020 2 years ago
Bump harvester-cloud-provider 0.1.10 Related issue: https://github.com/rancher/rancher/issues/36716 https://github.com/harvester/harvester/issues/1999 Signed-off-by: yaocw2020 <yaocanwu@gmail.com> — committed to rancher/rke2-charts by actions-user 2 years ago

Most upvoted comments

Closing it as it was supposed to be closed a few days ago and got reopened due to Zube requiring “Done” label on issues. @TachunLin, I see a comment above by @lanfon72 that this may still be an issue or there is another related issue, please handle it as necessary.

snasovich on Mar 23, 2022

According to what I learnt (thanks @FrankYang0529 ), the harvester-cloud-provider is the one removing the taint node.cloudprovider.kubernetes.io/uninitialized once it considers the node is ready. That provider fails with the error:

node_controller.go:390] Initializing node rke2-in-jeos-pool1-51a1e43e-fm4kk with cloud provider
node_controller.go:212] error syncing 'rke2-in-jeos-pool1-51a1e43e-fm4kk': failed to get instance metadata for node rke2-in-jeos-pool1-51a1e43e-fm4kk: Get "https://192.168.0.131:6443/apis/kubevirt.io/v1/namespaces/default/virtualmachines/rke2-in-jeos-pool1-51a1e43e-fm4kk": dial tcp 192.168.0.131:6443: i/o timeout, requeuing

And thus it is unable to remove the taint.

I am confused about why it requires the CNI plugin to be running. The url https://192.168.0.131:6443/apis/.... looks like a call to kube-api and kube-api is running in hostNetwork: true, i.e. it does not require a cni plugin to run. Could you verify if 192.168.0.131 is the IP where kube-api is running?

manuelbuil on Mar 10, 2022

Let me ask around. Before adding that toleration, I want to make sure that it will not introduce regression or other problems

manuelbuil on Mar 10, 2022

@thedadams, it looks like the root cause is Calico can’t be installed. If we change CNI to Canal, RKE2 1.22.7+rke2r1 can be installed.

In RKE2 1.22.7+rke2r1, it upgrades tigera-operator to v1.23.5. Also, tigera-operator updates tolerations in calico-typha. Originally, in tigera-operator v1.17.6, it use tolerations for calico-typha like following, so we can install RKE2 1.21.10+rke2r1 without error.

      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      - effect: NoExecute
        operator: Exists

However, in tigera-operator v1.23.5, it adds keys to tolerations like following, so calico-typha can’t tolerate node.cloudprovider.kubernetes.io/uninitialized.

      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
        operator: Exists
      - effect: NoExecute
        key: node-role.kubernetes.io/etcd
        operator: Exists

FrankYang0529 on Mar 8, 2022

@TachunLin The SLES issue seems to be unrelated to the issue with Ubuntu. Rancher is not responsible for installing the apparmor-parser, so the SLES issue is understood (the installation of apparmor-parser could be added to rancher-machine when the SLES operating system is detected).

For the Ubuntu image case, the nodes are tainted as NoSchedule because the cloud-provider (Harvester, in this case) isn’t setup on the node. Has it been verified that the Harvester cloud-provider is running successfully on these nodes? And has the Harvester cloud-provider been tested on Kubernetes v1.22+ outside of Rancher provisioning?

thedadams on Mar 7, 2022