rancher: Provisioning V2 / RKEv2 does not work with third party node drivers

Rancher Server Setup

  • Rancher version: 2.6.3
  • Installation option (Docker install/Helm Chart): docker

Information about the Cluster

  • Kubernetes version: v1.21.9+rke2r1
  • Cluster Type (Local/Downstream): Infrastructure using third party node driver

User Information

  • What is the role of the user logged in?: Admin

Describe the bug

When trying to provision a cluster with a third party node driver, that isn’t a builtin, provisioning of a rke2 cluster fails.

Third Party node drivers added to rancher get a randomly assigned name as their kubernetes resource name:

harvester       186d
linode          312d
nd-rjl9k        44m

But rke2 assumes the name when trying to provision machines leading to an error: This error also isn’t logged properly in the UI. There machines are only described as Waiting to schedule machine create.

  status:
    conditions:
    - message: nodedrivers.management.cattle.io "nutanix" not found
      reason: Error
      status: "False"
      type: CreateJob

I assume this piece of code is responsible: https://github.com/rancher/rancher/blob/release/v2.6/pkg/controllers/provisioningv2/rke2/machineprovision/args.go#L298

func getNodeDriverName(typeMeta meta.Type) string {
	return strings.ToLower(strings.TrimSuffix(typeMeta.GetKind(), "Machine"))
}

Which simply takes the Kind of the machine crd, in my case NutanixMachine.

To Reproduce

  1. Add a third party node driver
  2. Spawn a rke2 cluster

Result The cluster is stuck in provisioning and only showing Waiting to schedule machine create as status for machines

Expected Result The cluster provisions sucessfully

Workaround Create the nodedriver manually in the backing kubernetes cluster with the correct name

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

This is bug is caused by the fact that node drivers were initially designed to be decoupled from the name with RKE1, however with v2prov (and specifically CAPI), CRDs are required (which was correctly pointed out here: https://github.com/rancher/rancher/issues/37074#issue-1183315207). The linked code https://github.com/rancher/rancher/blob/release/v2.6/pkg/controllers/provisioningv2/rke2/machineprovision/args.go#L298 is responsible, however the true underlying culprit comes from here: https://github.com/rancher/rancher/blob/e5cc549591fbdf6aec91915b83384cd78b56f769/pkg/controllers/management/drivers/nodedriver/machine_driver.go#L224C54-L224C54. This piece of code uses the displayName of the node driver object, which is not settable at creation time from the UI. Additionally, there is no validation in place to prevent multiple node drivers from using the same displayName, which will cause the dynamic schema to thrash and potentially cause data loss, or from changing the displayName, which would also result in data loss. Although one can set this displayName manually, this is not a suitable long term solution.

A potential long term solution would be for the backend to use the k8s metadata name (which corresponds to the norman id), however the UI is using norman, and I was not able to create a node driver whilst specifying the id in a POST request using curl. This requires input from @rancher/rancher-team-1-neo-dev as to whether or not it is possible from within norman to specify the id in the request. There is no way to remove the rancher requirement that the names have to be unique due to the generated CRDs, and validating that all node drivers have a different display name would not be a suitable alternative as opposed to just using the name of the corresponding nodedriver CR. cc @gaktive

Workaround

The below script outlines a workaround, assuming one has already encounted the issue when attempting to create a third party driver in the rancher UI with the correct url. The node driver should be inactive before running this script, as deactivation causes CRs to be cleaned up.

(export NAME="<DRIVER NAME (must be [a-z]*)>" NODEDRIVER="<DRIVER ID (e.g. nd-12345)>"; kubectl get nodedriver "${NODEDRIVER}" -o yaml | yq 'del(.status) | .metadata |= with_entries(select(.key == "annotations")) | .metadata.annotations |= with_entries(select(.key == "publicCredentialFields" or .key == "privateCredentialFields"))' | yq ".metadata.name = strenv(NAME)" | yq ".spec.displayName = .metadata.name")

After this, the original node driver (with prefix nd-xxxxx) can and should be deleted, as the two node drivers will thrash if both are activated, fighting for ownership of the dynamic schema object which in turn will create the CAPI infrastructure machine and machine template CRDs.

These are the minimum required fields to create a node driver. Once this yaml is retrieved, it can be piped to kubectl apply and the correspondingly generated node driver should be created.

Note: you cannot attempt you use this script to create a node driver with an identical name or displayName to another, it won’t work as they are backed by k8s CRs.

@rancher/docs , FYI moved this to “Release Note” status as we would want to include the workaround https://github.com/rancher/rancher/issues/37074#issuecomment-1664722305 in the next release notes, not specifically 2023-Q4/2024-Q1 releases.

@jakefhyde can you file a ticket in rancher/dashboard and link back here?

Can we re-open this ticket? Its not possible to provision RKE2 clusters with the nutanix node driver