terraform-provider-rancher2: Cluster Config loading error: Cannot read properties of undefined (reading 'length') when machineSelectorConfig is set to 'null'

Internal reference: SURE-6770 Reported in 2.7.5

Issue description: A user encountered an issue loading the Config tab for specific clusters from the Cluster Management page.

When going to the Cluster Management page, selecting a cluster and clicking the Config button, the page displays a “Loading …” text but ends up with the following message: Cannot read properties of undefined (reading 'length')

What the UI dev console outputs as errors are the following:

- console.js:31 Cannot read properties of null (reading 'cloud-provider-name')
- console.js:31 Cannot read properties of undefined (reading 'length')

The UI loads fine for 1 out of 3 clusters, and the user can’t find any configuration for the working cluster that mentions “cloud-provider-name”.

When comparing the cluster YAMLs again, they found the following difference: for the two clusters that have an issue with the config UI, they see the following property in the YAML:

  • For the one cluster that works, this property is not present at all:
machineSelectorConfig:
  - config: null

This config block can have “cloud-provider-name”, so we advised the customer to try removing it or setting it to an empty string or an actual string value E.g.: machineSelectorConfig:

    machineSelectorConfig:
    - config:
        cloud-provider-name: azure
        protect-kernel-defaults: false
    registries: {} 

Removing this property or setting the value of config to {} instead of null fixed the UI issue in this case.

The user does not know what has caused this property to be updated (with the null value),  all their clusters are created in the same way using Terraform, and they have been able to read and modify the config from the management console for all clusters, but recently it has stopped working for 2 out of 3 clusters in one of their setups.

Business impact: The Cluster config page failed to load with any useful message, making it difficult to say what went wrong.

Workaround: Remove the "machineSelectorConfig: - config: null" code block completely, and the config starts loading.

Actual behavior: The Cluster config page failed to load or provide a useful message as to why it failed to load.

Expected behavior: The page should load okay, or the message should highlight that null is not an accepted config value.

Additional notes:

  • Update from the user:

I was experimenting with changing RKE config parameters via Terraform and noticed that the rancher2 provider wants to change the machineSelectorConfig parameter when changing another parameter, such as etcd snapshot retention. I attempted this with the newest version of the rancher2 provider: https://registry.terraform.io/providers/rancher/rancher2/3.1.1

As a workaround, I tried adding this particular parameter to the ignore_changes lifecycle section like this:

resource "rancher2_cluster_v2" "cluster" {
  name = var.cluster_name
  kubernetes_version = var.kubernetes_version
  lifecycle {
    ignore_changes = [
      rke_config[0].machine_selector_config,
   ]
  }

When I do this, I no longer see that Terraform wants to change this parameter in the Terraform plan output. However, after running an apply operation, the machineSelectorConfig is still changed to null, which then causes the UI issue again when attempting to view the config. It appears as if any kind of change to the RKE config on an RKE2 cluster resource via the rancher2 Terraform provider will result in this bug.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (10 by maintainers)

Most upvoted comments

Resolution of this issue will further be tracked in https://github.com/rancher/dashboard/issues/9888. Closing this out now, as this is not an issue with tfp-rancher2

After discussion about the test cases and just in general - we’ve determined that this is a UI issue really at this point. The cluster comes up fine after terraform apply and the UI just fails to display when there is no machine without label selectors.

The UI should be able to handle when there is no config - since the cluster does actually come up fine. Inserting the default field to mimic the UI behavior is ok which came from this - but it leads to a bunch of other behavior.

I opened a separate issue as well for the labelSelector problem: https://github.com/rancher/dashboard/issues/9888

@Josh-Diamond go ahead and either remove the last test case or modify it so that you don’t have to edit stuff in the UI since if a user is provisioning via terraform they should just modify the tf config rather than editing in the UI.

@Josh-Diamond reproduced this issue yesterday while testing an upgrade case for https://github.com/rancher/terraform-provider-rancher2/issues/1074 and I was able to reproduce it as well. It appears that the TF rancher2 provider is removing an protect-kernel-defaults from v2 clusters which Rancher sets to false by default on the backend for RKE2/k3s clusters. Setting it to true is required for a hardened RKE2 cluster. Removing that unexposed field sets config back to null which causes the page crash.

I’ve discussed this with my team and put a fix into the provider for both Rancher 2.7 and 2.8 that will both resolve this issue and unblock the other one.

Please test this issue on Rancher v2.7 with RC v3.2.0-rc5. Thank you!

Thank you for your input @a-blender 🙇 🙏

@aalves08 After looking into this with Josh, correct I did not repro/was not able to reproduce this on an earlier version of TF and based on the QA test results neither could Josh. He wasn’t seeing Edit Config have a load issue on any version of TF past v3.1.1 - the yaml in the rancher UI shows up as

machine_selector_config {
  config: {}
}

or the entire block is auto removed by Terraform in the latest v3.2.0-rc2 if config=null.

Either case will not be a problem for future customers.

On the internal issue SURE-6770, the customer 1) likely provisioned their cluster with an even older version of tf (they mentioned they were experimenting with other fields on TF v3.1.1 but did not explicitly state the version where they saw the load page bug on). Or 2) provisioned their cluster with TF and unintentionally modified the cluster yaml to set config = null instead of an empty map.

Since this is not reproducible, the latest TF handles this edge case better, and we’ve already discussed that a UX fix is infeasible due to how RKE2 is designed I’d say this issue can be closed.

@Josh-Diamond the fix comes only from the backend. commit here (if I am not mistaken): https://github.com/rancher/terraform-provider-rancher2/commit/7b9d01cf4e0d1842ff0dd97280285bad637f19d2

SInce the “fix” appears to be introduced w/ tfp-rancher2 v3.2.0, I thought any version below that would reproduce the issue, so I used tfp-rancher2 v3.1.1. Once this comes back to-test, i’ll try to reproduce this using an older tfp-rancher2 version, and re-validate this then. Thanks so much for following up w/ me on this!

As for the repro steps, I never tested this with TF, but I would go the same direction as you (v3.1.1 vs v3.2.0). Maybe @a-blender can give a helping hand here so that we can move this back to test.

Thanks

I see the same problem with clusters which are generated with a (RKE2) Cluster Tempalte (https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/manage-clusters/manage-cluster-templates)