harvester: [BUG] RKE2 provisioning fails when Rancher has no internet access (air-gapped)

Rancher 2.6.3 Harvester 1.0.0

Expected behavior

The driver binaries of “built-in” node drivers are included in the Rancher release and do not have to be downloaded post-install.

This is the case for other node drivers of type “built-in”, such as AWS, Azure, etc. The url field of these node drivers is set to "url": "local://",

Actual behavior

In Rancher 2.6.3 the Harvester node driver is marked as “built-in” driver, yet the url field is set to a external URL https://releases.rancher.com/harvester-node-driver/v0.3.4/docker-machine-driver-harvester-amd64.tar.gz.

When provisioning a Harvester RKE2 cluster from Rancher, the provisioning will fail on air-gapped Rancher instances since the driver can’t be downloaded from the internet.

[INFO ] provisioning bootstrap node(s) qxn2533-test-pool1-654f8d44c5-txjqd: waiting to schedule machine create
[INFO ] provisioning bootstrap node(s) qxn2533-test-pool1-654f8d44c5-txjqd: creating server (HarvesterMachine) in infrastructure provider
[INFO ] failing bootstrap machine(s) qxn2533-test-pool1-654f8d44c5-92t59: failed creating server (HarvesterMachine) in infrastructure provider: CreateError: Failure detected from referenced resource rke-machine.cattle.io/v1, Kind=HarvesterMachine with name "qxn2533-test-pool1-33b61c5f-5nxbm": Downloading driver from https://releases.rancher.com/harvester-node-driver/v0.3.4/docker-machine-driver-harvester-amd64.tar.gz
ls: cannot access 'docker-machine-driver-*': No such file or directory
downloaded file failed sha256 checksum
download of driver from https://releases.rancher.com/harvester-node-driver/v0.3.4/docker-machine-driver-harvester-amd64.tar.gz failed and join url to be available on bootstrap node
Screenshot 2022-01-12 at 18 24 48 Screenshot 2022-01-12 at 18 25 05

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (3 by maintainers)

Most upvoted comments

I finally can setup RKE2 in pure air-gapped environment, but we need some weird steps. Base on David’s setup.

  1. (k3s with Rancher VM) Update coredns in sudo vim /var/lib/rancher/k3s/server/manifests/coredns.yaml.
# update ConfigMap
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        hosts /etc/coredns/customdomains.db {
          fallthrough
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
  customdomains.db: |
    192.168.0.50 airgap helm-install.local

# update deployment
# remove NodeHost key and path
# add customdomains.db
            - key: customdomains.db
              path: customdomains.db
  1. Import SLES15-SP3-JeOS.x86_64-15.3-OpenStack-Cloud-GM.qcow2 to Harvester.
  2. Create RKE2 cluster with following userData.
runcmd:
- - systemctl
  - enable
  - --now
  - qemu-guest-agent
bootcmd:
  - echo 192.168.0.50 helm-install.local myregistry.local >> /etc/hosts
  1. (RKE2 VM) Create a file in /etc/rancher/agent/tmp_registries.yaml:
mirrors:
  docker.io:
    endpoint:
      - "https://myregistry.local:5000"
configs:
  "myregistry.local:5000":
    tls:
      insecure_skip_verify: true
  1. (RKE2 VM) Update rancher-system-agent config file /etc/rancher/agent/config.yaml.
agentRegistriesFile: /etc/rancher/agent/tmp_registries.yaml
  1. (RKE2 VM) Restart rancher-system-agent.
systemctl restart rancher-system-agent.service
  1. (RKE2 VM) Create a file in /etc/rancher/rke2/registries.yaml:
mirrors:
  docker.io:
    endpoint:
      - "https://myregistry.local:5000"
configs:
  "myregistry.local:5000":
    tls:
      insecure_skip_verify: true
  1. (RKE2 VM) Update ConfigMap kube-system/rke2-coredns-rke2-coredns in RKE2.
data:
  Corefile: ".:53 {\n    errors \n    health  {\n        lameduck 5s\n    }\n    ready
    \n    kubernetes   cluster.local  cluster.local in-addr.arpa ip6.arpa {\n        pods
    insecure\n        fallthrough in-addr.arpa ip6.arpa\n        ttl 30\n    }\n    prometheus
    \  0.0.0.0:9153\n   hosts /etc/coredns/customdomains.db helm-install.local {\n
    \   fallthrough\n    }\n forward   . /etc/resolv.conf\n    cache   30\n    loop
    \n    reload \n    loadbalance \n}"
  customdomains.db: |
    192.168.0.50 helm-install.local
  1. (RKE2 VM) Update Deployment kube-system/rke2-coredns-rke2-coredns.
# add following to volumes[].configMap
- key: customdomains.db
  path: customdomains.db

The 4th to 6th steps are weird. Theoretically, we can create /etc/rancher/agent/registries.yaml in cloud-config, but I am not sure which process will overwrite my content. Users can update /etc/rancher/agent/registries.yaml and restart rancher-system-agent to see the overwriting behavior.

If we can write /etc/rancher/agent/registries.yaml and /etc/rancher/rke2/registries.yaml in cloud-config, then we only need to operate step 8th and 9th manually for provisioning RKE2.

per @thedadams, CI build that should have fixes for this is rancher/rancher:v2.6-a49b9d913555d59834a30f6e2ae676f0bd54bba6-head

@alexdepalex wrote:

@janeczku As a workaround, we change url of the node driver in the nodedrivers/harverster crd to a locally hosted node driver. It seems though that this change is reverted from time to time. Is there a way to disable this?

I believe that this will revert every time a server/control node started in the cluster. That said, this is the best workaround at this time. /cc @oats87