rancher: Air-gapped provisioning of RKE2 cluster fails when vSphere Cloud Provider (CPI/CSI) enabled in cluster configuration

Issue description:

When provisioning an RKE2 Node Driver cluster with the vSphere Cloud Provider (i.e. CPI/CSI) configured in the cluster configuration (versus installed via Apps & Marketplace after cluster provisioning) the CPI and CSI charts do not use the configured system-default-registry, but attempt to pull from DockerHub. In an air-gapped environment without access to DockerHub the CPI and CSI workloads therefore remain in a CrashLoopBackoff attempting to pull the image, and node provisioning does not complete.

Based on backend investigation, it doesn’t really have anything to do with the charts because of being able to edit as YAML, the user is able to set the global.cattle.systemDefaultRegistry, which means that it’s able to use a private registry. It looks like the UI (or wherever this is configurable) should use it when provisioning.

There is a workaround (noted below).

Business impact:

Unable to provision RKE2 cluster with vSphere cloud provider set in cluster configuration if using a private registry in an air-gapped environment.

Troubleshooting steps:

N/A

Repro steps:

  • Populate a private registry with the Rancher v2.6.0 images (https://rancher.com/docs/rancher/v2.6/en/installation/other-installation-methods/air-gap/populate-private-registry/)
  • Install Rancher v2.6.0 on a single node RKE cluster with --set systemDefaultRegistry per air-gap installation docs https://rancher.com/docs/rancher/v2.6/en/installation/other-installation-methods/air-gap/install-rancher/
  • Provision an RKE2 cluster using the vSphere Node Driver, and the private registry config, without setting the vSphere Cloud Provider enabled in the cluster configuration, and observe successful cluster provisioning. After cluster provisioning deploy the vSphere CSI and CPI charts from the Apps & Marketplace and observe chart workloads successfully deployed using private registry images.
  • Provision an RKE cluster using the vSphere Node Driver, and the private registry config, and set the vSphere Cloud Provider to enabled in the cluster configuration, passing in configuration under Addons, and observe successful failed provisioning. Download SSH key for node and using kubectl on the node, observe CPI and CSI workloads failing attempting to pull image from DockerHub not private registry

Workaround:

Is workaround available and implemented? Yes

What is the workaround:

Install cluster without configuring vSphere Cloud Provider and install vSphere CPI/CSI charts after cluster provisioning

Actual behavior:

vSphere CSI/CPI workloads do not use configured system-default-registry if configured at cluster provisioning time, preventing successful cluster provisioning in air-gapped environment

Expected behavior:

vSphere CSI/CPI workloads use configured system-default-registry if configured at cluster provisioning time

Files, logs, traces:

N/A

Additional notes:

By clicking ‘Edit as YAML’ on the cluster configuration user could set global.cattle.systemDefaultRegistry manually for the rancher-vsphere-cpi and rancher-vsphere-csi but expectation would be this uses the system-default-registry automatically

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Yeah, this needs to be fixed in the RKE2 vSphere provider Helm chart as described in https://github.com/rancher/rancher/issues/36467#issuecomment-1034809994