terraform-provider-rancher2: [BUG] Occasionally RKE2 cluster gets destroyed after cluster configuration is changed using terraform provider

Important: Please see https://github.com/rancher/terraform-provider-rancher2/issues/993#issuecomment-1611922983 on the status of this issue following completed investigations.

Rancher Server Setup

Rancher version: 2.6.8
Installation option (Docker install/Helm Chart):
- installed as helm chart
- running on k3s 1.24.4
Proxy/Cert Details: N/A

Information about the Cluster

Kubernetes version: 1.23.9
Cluster Type Downstream:
- Custom RKE2 v1.23.9+rke2r1 running on AWS
- Provisioned with Terraform provider rancher/rancher2 version 1.24.1

cluster configuration:

{
  "kubernetesVersion": "v1.23.9+rke2r1",
  "rkeConfig": {
    "upgradeStrategy": {
      "controlPlaneConcurrency": "1",
      "controlPlaneDrainOptions": {
        "enabled": false,
        "force": false,
        "ignoreDaemonSets": true,
        "IgnoreErrors": false,
        "deleteEmptyDirData": true,
        "disableEviction": false,
        "gracePeriod": 0,
        "timeout": 10800,
        "skipWaitForDeleteTimeoutSeconds": 600,
        "preDrainHooks": null,
        "postDrainHooks": null
      },
      "workerConcurrency": "10%",
      "workerDrainOptions": {
        "enabled": false,
        "force": false,
        "ignoreDaemonSets": true,
        "IgnoreErrors": false,
        "deleteEmptyDirData": true,
        "disableEviction": false,
        "gracePeriod": 0,
        "timeout": 10800,
        "skipWaitForDeleteTimeoutSeconds": 600,
        "preDrainHooks": null,
        "postDrainHooks": null
      }
    },
    "chartValues": null,
    "machineGlobalConfig": {
      "cloud-provider-name": "aws",
      "cluster-cidr": "100.64.0.0/13",
      "cluster-dns": "100.64.0.10",
      "cluster-domain": "cluster.local",
      "cni": "none",
      "disable": [
        "rke2-ingress-nginx",
        "rke2-metrics-server",
        "rke2-canal"
      ],
      "disable-cloud-controller": false,
      "kube-apiserver-arg": [
        "allow-privileged=true",
        "anonymous-auth=false",
        "feature-gates=CustomCPUCFSQuotaPeriod=true",
        "api-audiences=https://<REDACTED>-oidc.s3.eu-central-1.amazonaws.com,https://kubernetes.default.svc.cluster.local,rke2",
        "audit-log-maxage=90",
        "audit-log-maxbackup=10",
        "audit-log-maxsize=500",
        "audit-log-path=/var/log/k8s-audit/audit.log",
        "audit-policy-file=/etc/kubernetes-/audit-policy.yaml",
        "authorization-mode=Node,RBAC",
        "bind-address=0.0.0.0",
        "enable-admission-plugins=PodSecurityPolicy,NodeRestriction",
        "event-ttl=1h",
        "kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP",
        "profiling=false",
        "request-timeout=60s",
        "runtime-config=api/all=true",
        "service-account-key-file=/etc/kubernetes-wise/service-account.pub",
        "service-account-lookup=true",
        "service-account-issuer=https://<REDACTED>-oidc.s3.eu-central-1.amazonaws.com",
        "service-account-signing-key-file=/etc/kubernetes-wise/service-account.key",
        "service-node-port-range=30000-32767",
        "shutdown-delay-duration=60s",
        "tls-min-version=VersionTLS12",
        "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
        "tls-min-version=VersionTLS12",
        "v=2"
      ],
      "kube-apiserver-extra-mount": [
        "/etc/kubernetes-wise:/etc/kubernetes-wise:ro",
        "/var/log/k8s-audit:/var/log/k8s-audit:rw"
      ],
      "kube-controller-manager-arg": [
        "allocate-node-cidrs=true",
        "attach-detach-reconcile-sync-period=1m0s",
        "bind-address=0.0.0.0",
        "configure-cloud-routes=false",
        "feature-gates=CustomCPUCFSQuotaPeriod=true",
        "leader-elect=true",
        "node-monitor-grace-period=2m",
        "pod-eviction-timeout=220s",
        "profiling=false",
        "service-account-private-key-file=/etc/kubernetes-wise/service-account.key",
        "use-service-account-credentials=true",
        "terminated-pod-gc-threshold=12500",
        "tls-min-version=VersionTLS12",
        "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
        "tls-min-version=VersionTLS12"
      ],
      "kube-controller-manager-extra-mount": [
        "/etc/kubernetes-wise:/etc/kubernetes-wise:ro"
      ],
      "kube-proxy-arg": [
        "conntrack-max-per-core=131072",
        "conntrack-tcp-timeout-close-wait=0s",
        "metrics-bind-address=0.0.0.0",
        "proxy-mode=iptables"
      ],
      "kube-scheduler-arg": [
        "bind-address=0.0.0.0",
        "port=0",
        "secure-port=10259",
        "profiling=false",
        "leader-elect=true",
        "tls-min-version=VersionTLS12",
        "v=2",
        "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
        "tls-min-version=VersionTLS12"
      ],
      "kubelet-arg": [
        "network-plugin=cni",
        "cni-bin-dir=/opt/cni/bin/",
        "cni-conf-dir=/etc/cni/net.d/",
        "feature-gates=CustomCPUCFSQuotaPeriod=true",
        "config=/etc/kubernetes-wise/kubelet.yaml",
        "exit-on-lock-contention=true",
        "lock-file=/var/run/lock/kubelet.lock",
        "pod-infra-container-image=docker-k8s-gcr-io.<REDACTED>/pause:3.1",
        "register-node=true",
        "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
        "tls-min-version=VersionTLS12",
        "v=4"
      ],
      "profile": "cis-1.6",
      "protect-kernel-defaults": true,
      "service-cidr": "100.64.0.0/13"
    },
    "additionalManifest": "<REDACTED>",
    "registries": {
      "mirrors": {
        "docker.io": {
          "endpoint": [
            "<REDACTED>"
          ]
        },
        "gcr.io": {
          "endpoint": [
            "<REDACTED>"
          ]
        },
        "k8s.gcr.io": {
          "endpoint": [
            "<REDACTED>"
          ]
        },
        "quay.io": {
          "endpoint": [
            "<REDACTED>"
          ]
        }
      },
      "configs": {
        "<REDACTED>": {}
      }
    },
    "etcd": {
      "snapshotScheduleCron": "0 */6 * * *",
      "snapshotRetention": 12,
      "s3": {
        "endpoint": "s3.eu-central-1.amazonaws.com",
        "bucket": "etcd-backups-<REDACTED>",
        "region": "eu-central-1",
        "folder": "etcd"
      }
    }
  },
  "localClusterAuthEndpoint": {},
  "defaultClusterRoleForProjectMembers": "user",
  "enableNetworkPolicy": false
}

Additional info:

I am using custom CNI: aws-vpc-cni installed via additional_manifest

Describe the bug Occasionally simple cluster configuration change for example change lables in manifests passed via additional_manifest applied with terraform provider causing managed RKE2 cluster to destroy.

terraform plan looks similar to this:

Terraform Plan output

Terraform will perform the following actions:

  # module.cluster.rancher2_cluster_v2.this will be updated in-place
  ~ resource "rancher2_cluster_v2" "this" {
        id                                       = "fleet-default/o11y-euc1-se-main01"
        name                                     = "o11y-euc1-se-main01"
        # (10 unchanged attributes hidden)

      ~ rke_config {
          ~ additional_manifest   = <<-EOT
                ---
                apiVersion: v1
                kind: Namespace
                metadata:
                  labels:
              -     test: test
              +     test1: test1
                  name: my-namespace
            EOT
      }
    }

Sometimes once change like this is applied rancher immediately trying to delete that managed cluster for some reason. On UI it looks like this:

Rancher UI screenshot:

Rancher logs:

rancher logs:

2022/09/08 08:58:30 [DEBUG] [planner] rkecluster fleet-default/<REDACTED>: unlocking 810235e7-ecc0-4ba7-81c8-55d778594926
2022/09/08 08:58:30 [INFO] [planner] rkecluster fleet-default/<REDACTED>: waiting: configuring bootstrap node(s) custom-7808e68fb38f: waiting for plan to be applied
2022/09/08 08:58:30 [DEBUG] [CAPI] Cannot retrieve CRD with metadata only client, falling back to slower listing
2022/09/08 08:58:30 [DEBUG] DesiredSet - Patch rbac.authorization.k8s.io/v1, Kind=Role fleet-default/crt-<REDACTED>-nodes-manage for auth-prov-v2-roletemplate-<REDACTED> nodes-manage -- [PATCH:{"metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA"}},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"],"resources":["machines"],"verbs":["*"]}]}, ORIGINAL:{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4yRT4/bLBDGv8qrOb4KqQk2xpZ66qGHSj2sql6qHAYYNnRtsACnlaJ894rsVo626p8bPDDP/GaeC8xU0GJBGC+AIcSCxceQ6zXqr2RKprJPPu4NljLR3sc33sIIuJYTW1I8s/OBpThRoXmZsBAzltFqOFvqseGw+61R/BYoscfzE4wwY8BHmimUuw9nsfvvgw/27UOc6NNLg78aBpwJRgjRUmbPvv9Ukxc0tRCuO5hQ0/THLZwwn2AEqfmhE704SNW3BzK9pr43gxkaRKcbh8ZSM3Symr6AmVReL4m9gr3HcRNRYZYcrlOpg1TgB3KUKBjKMH65AC7+M6XsY4ARaiq+nn14vF9mjeLJh5reu2nNhRJsTL+Ett5i5poLrV3LTMMlazslmHYWWS+wE46jsOTgerzuIK3TBvM+xXWpNzDPnfbf2ZPKex/huINEOa7J0Mc65e3TmkucWa8aRVI5LZSD3U91EJYPtlW602pTOTUOLUeuBG4qKTScd8MgpdxUobWWZNVg5J2vbPq+EchJDHpTWyGk6dv20Dp5z3rjnNGcfKBcH86U9E38H47X4/VHAAAA///VWUNFSgMAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-<REDACTED>","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"name":"crt-<REDACTED>-nodes-manage","namespace":"fleet-default","ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"<REDACTED>","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}]},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-7808e68fb38f","custom-93d19d48b5b8","custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6"],"resources":["machines"],"verbs":["*"]}]}, MODIFIED:{"kind":"Role","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"crt-<REDACTED>-nodes-manage","namespace":"fleet-default","creationTimestamp":null,"labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-<REDACTED>","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"<REDACTED>","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}]},"rules":[{"verbs":["*"],"apiGroups":["cluster.x-k8s.io"],"resources":["machines"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"]}]}, CURRENT:{"kind":"Role","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"crt-<REDACTED>-nodes-manage","namespace":"fleet-default","uid":"2d3678e7-1904-442f-bfa6-ef4ad97baa40","resourceVersion":"32202831","creationTimestamp":"2022-09-08T07:56:23Z","labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRS48UIRD+K6aOphmboYd+JJ48eDDxsNl4MXMooNjBpaED9Giymf9umF3TkzU+bvBBffU9nmCmggYLwvQEGEIsWFwMuV6j+ka6ZCq75OJOYymedi6+cwYmwLWc2JLimZ33LEVPhebFYyGmDaNVc7bUY8uh+SNR/B4osYfzI0wwY8AHmimUmw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQRS8f1B9GIvh77bk+4V9b0e9dgiWtVa1Iba8SDrthfFOpXX6bFXLm51Wk9UmCGLqy/VYXVyR5YSBU0Zpq9PgIv7Qim7GGCCWperZxceblOuHT26UGv94NdcKMGm6bc212v/XHGhlO2Ybrlk3WEQTFmDrBd4EJajMGThcrw0kFa/ifmY4rrUG+jnTbsf7HHIOxfh2ECiHNek6XN1ef205hJn1g/tQHKwSgwWml/oKAwfTTeogxo2lFNr0XDkg8ANpQE154dxlFJuqFBKSTLDqOUNr2z7vhXISYxqQzshpO67bt9Zeav1qnNGfXKBcn04U1JX8C0cL8fLzwAAAP//skxNxGMDAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-<REDACTED>","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"<REDACTED>","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}],"managedFields":[{"manager":"rancher","operation":"Update","apiVersion":"rbac.authorization.k8s.io/v1","time":"2022-09-08T07:58:00Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:objectset.rio.cattle.io/applied":{},"f:objectset.rio.cattle.io/id":{},"f:objectset.rio.cattle.io/owner-gvk":{},"f:objectset.rio.cattle.io/owner-name":{},"f:objectset.rio.cattle.io/owner-namespace":{}},"f:labels":{".":{},"f:objectset.rio.cattle.io/hash":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"1b13bbf4-c016-4583-bfda-73a53f1a3def\"}":{}}},"f:rules":{}}}]},"rules":[{"verbs":["*"],"apiGroups":["cluster.x-k8s.io"],"resources":["machines"],"resourceNames":["custom-7808e68fb38f","custom-93d19d48b5b8","custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6"]}]}]
2022/09/08 08:58:30 [DEBUG] DesiredSet - Updated rbac.authorization.k8s.io/v1, Kind=Role fleet-default/crt-<REDACTED>-nodes-manage for auth-prov-v2-roletemplate-<REDACTED> nodes-manage -- application/strategic-merge-patch+json {"metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA"}},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"],"resources":["machines"],"verbs":["*"]}]}
2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) /v1, Kind=ServiceAccount fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f
2022/09/08 08:58:30 [DEBUG] [plansecret] reconciling secret fleet-default/custom-7808e68fb38f-machine-plan
2022/09/08 08:58:30 [DEBUG] [plansecret] fleet-default/custom-7808e68fb38f-machine-plan: rv: 32202835: Reconciling machine PlanApplied condition to nil
2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) /v1, Kind=Secret fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f
2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) rbac.authorization.k8s.io/v1, Kind=Role fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f
2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) rbac.authorization.k8s.io/v1, Kind=RoleBinding fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f
2022/09/08 08:58:30 [DEBUG] [CAPI] Reconciling
2022/09/08 08:58:30 [DEBUG] [CAPI] Cluster still exists
2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete cluster.x-k8s.io/v1beta1, Kind=Cluster fleet-default/<REDACTED> for rke-cluster fleet-default/<REDACTED>
2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete rke.cattle.io/v1, Kind=RKEControlPlane fleet-default/<REDACTED> for rke-cluster fleet-default/<REDACTED>
2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete rke.cattle.io/v1, Kind=RKECluster fleet-default/<REDACTED> for rke-cluster fleet-default/<REDACTED>
2022/09/08 08:58:30 [DEBUG] [rkecontrolplane] (fleet-default/<REDACTED>) Peforming removal of rkecontrolplane
2022/09/08 08:58:30 [DEBUG] [rkecontrolplane] (fleet-default/<REDACTED>) listed 3 machines during removal
2022/09/08 08:58:30 [DEBUG] [UnmanagedMachine] Removing machine fleet-default/custom-607703a1e39b in cluster <REDACTED>
2022/09/08 08:58:30 [DEBUG] [UnmanagedMachine] Safe removal for machine fleet-default/custom-607703a1e39b in cluster <REDACTED> not necessary as it is not an etcd node

On RKE2 bootstrap node in rke2-server logs we can see this:

rke2-server logs on bootstrap node

Sep 07 11:41:57 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:41:57Z" level=info msg="Removing name=ip-yyy-yy-yy-yyy.eu-central-1.compute.internal-ee7ac07c id=1846382134098187668 address=172.28.74.196 from etcd"
Sep 07 11:41:57 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:41:57Z" level=info msg="Removing name=ip-zzz-zz-zz-zzz.eu-central-1.compute.internal-bc3f1edb id=12710303601531451479 address=172.28.70.189 from etcd"
Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Stopped tunnel to zzz.zz.zz.zzz:9345"
Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Stopped tunnel to yyy.yy.yy.yyy:9345"
Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Proxy done" err="context canceled" url="wss://yyy.yy.yy.yyy:9345/v1-rke2/connect"
Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Proxy done" err="context canceled" url="wss://zzz.zz.zz.zzz:9345/v1-rke2/connect"

To Reproduce Unfortunately I can’t reproduce this reliably but this happens very often. Steps I am using to reproduce this issue:

provision RKE2 cluster with terraform
modify additional_manifest for RKE2 cluster
apply change

Result Occasionally managed cluster gets deleted by rancher.

Expected Result Change is actually applied and clusters is not deleted.

I did some tests that does exactly the same change (modify additional_manifest) bypassing terraform by calling rancher API directly and that never caused cluster deletion for 2k+ iterations. While using terraform provider some times it takes up to 10 attempts to reproduce this issue.

I am happy to provide any other info to investigate this further. This is causing massive outages for my clusters as they are just getting destroyed.

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 54 (9 by maintainers)

Most upvoted comments

Just to expand on the reasoning for the move to Q3, the issue should no longer be reproducible in about-to-be-released Rancher 2.7.5 due to https://github.com/rancher/rancher/issues/41887 fixed there.

This issue is now specifically to fix TFP side to no longer clear finalizers as it’s still not ideal it does that. The priority is much lower though given this “data loss” issue should no longer be happening on Rancher 2.7.5+.

snasovich on Jun 28, 2023

Per @jakefhyde to fix this we will need to switch to using Steve (v1) APIs from currently used Norman (v3) - or even to native k8s - both are pretty big undertakings so this may take a while to address especially since the immediate issue is now addressed on rancher/rancher since 2.7.5+.

snasovich on Aug 17, 2023

@jakefhyde sure,

apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: my-restore
spec:
  backupFilename: test-**************.tar.gz

This was happening even without backup/restore but after backup/restore it was much easier to reproduce so I am not sure if that is related.

I will try to reproduce this with latest rancher and latest tf provider.

riuvshyn on Jun 1, 2023

@riuvshyn , thank you for confirming this. We will however keep this issue open as we want to address TF provider removing finalizers as well.

snasovich on Jul 6, 2023

I can confirm that I can not reproduce it anymore on 2.7.5! 🥳 🥳 🥳 🥳 🥳 cc @snasovich @jakefhyde @Oats87 Thank you very much!

riuvshyn on Jul 6, 2023

@Oats87 Thank you for looking into this. I will investigate this issue

a-blender on Jun 20, 2023

Did some digging into this and it looks like what is happening is terraform is clearing the finalizers on the provisioning.cattle.io object. Unfortunately, our generating controller will tell run an empty apply (deletion) if this is the case: https://github.com/rancher/rancher/blob/a05de31fccb10059447c169f28dcc2068982a6f0/pkg/controllers/provisioningv2/provisioningcluster/controller.go#L289-L292

This is a bug caused by problems in multiple components and while we can resolve it in the codebase for Rancher, I have not deduced a good workaround for this issue at this point. This is likely going to cause problems in other parts of the provider as well, for example, during deletion I would expect that wiping finalizers can lead to orphaned objects.

Oats87 on Jun 20, 2023

@jakefhyde

Ok, got this reproduced on fresh setup, even without backup/restore rancher version: 2.7.3 terraform provider: 3.0.0 rke2 version: v1.24.9+rke2r2 terraform: 1.0.8

It took 19 iterations to reproduce the only change was applying is label:

resource "rancher2_cluster_v2" "this" {
...
  labels = {
    "provider.cattle.io" = "rke2"
    "test"               = "test_${formatdate("hh_mm_ss", timestamp())}"
  }
...
}

here is example of rancher2_cluster_v2 resource maybe that will help to reproduce it.

resource "rancher2_cluster_v2" "this" {
  name = "test-cluster"
  enable_network_policy = false
  kubernetes_version = "v1.24.9+rke2r2"
  default_cluster_role_for_project_members = "user"

  labels = {
    "provider.cattle.io" = "rke2"
    "test"               = "test_${formatdate("hh_mm_ss", timestamp())}"
  }

  annotations = {
    "aws.wise.com/region" = "eu-cenral-1"
    "ui.rancher/badge-color" = "#ffb619"
    "ui.rancher/badge-icon-text" = "TEST"
    "ui.rancher/badge-text" = "TEST"
  }

  rke_config {
    # Note: additional_manifest expects string manifests so here we join content of multiple manifests files and passing it as string.
    additional_manifest = join("\n", [
      templatefile("${path.module}/manifests/cilium.yaml", {
        cluster             = "test-cluster"
      }),
      templatefile("${path.module}/manifests/aws-cloud-controller-chart.yaml", {
        cluster             = "test-cluster"
      }),
      templatefile("${path.module}/manifests/rke2-coredns-values.yaml", {
        cluster             = "test-cluster"
      }),
      templatefile("${path.module}/manifests/node-local-dns-chart.yaml", {
        cluster             = "test-cluster"
      }),
      templatefile("${path.module}/manifests/crds.yaml", {
        cluster             = "test-cluster"
      }),
    ])
    registries {
      configs {
        hostname = "registry.com"
      }
      dynamic "mirrors" {
        for_each = local.container_registry_mirrors
        content {
          hostname  = mirrors.value["hostname"]
          endpoints = mirrors.value["endpoints"]
        }
      }
    }

    etcd {
      disable_snapshots      = false
      snapshot_schedule_cron = "0 */4 * * *" # every 6h
      snapshot_retention     = 6

      s3_config {
        bucket   = "backups"
        endpoint = "s3.${module.networking.vpc_region}.amazonaws.com"
        folder   = "etcd"
        region   = "eu-central-1"
      }
    }

    machine_global_config = <<EOF
    # CIS profile
    profile: cis-1.6
    protect-kernel-defaults: true
    cluster-cidr: 100.64.128.0/17
    cluster-dns: 100.64.0.2
    cluster-domain: cluster.local
    cni: cilium
    service-cidr: 100.64.0.0/17 
    # Note: This is disabled because we are using out-of-tree CCM
    disable-cloud-controller: true
    # Note: Kubelet is using the out-of-tree aws-cloud-controller-manager.
    #       See 'manifests/aws-cloud-controller.yaml' for the relevant manifests.
    cloud-provider-name: external
    etcd-expose-metrics: true
    kube-apiserver-arg:
      - allow-privileged=true
      - anonymous-auth=false
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - api-audiences=https://${module.oidc.oidc_issuer_domain},https://kubernetes.default.svc.cluster.local,rke2
      - audit-log-maxage=90
      - audit-log-maxbackup=10
      - audit-log-maxsize=500
      - audit-log-path=/var/log/k8s-audit/audit.log
      - audit-policy-file=/etc/kubernetes/audit-policy.yaml
      - authorization-mode=Node,RBAC
      - bind-address=0.0.0.0
      - enable-admission-plugins=PodSecurityPolicy,NodeRestriction
      - event-ttl=1h
      - kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP
      - profiling=false
      - request-timeout=60s
      - runtime-config=api/all=true
      - service-account-key-file=/etc/kubernetes/service-account.pub
      - service-account-lookup=true
      - service-account-issuer=https://${module.oidc.oidc_issuer_domain}
      - service-account-signing-key-file=/etc/kubernetes/service-account.key
      - service-node-port-range=30000-32767
      - shutdown-delay-duration=60s
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
      - v=2
    kube-apiserver-extra-mount:
      - "/etc/kubernetes:/etc/kubernetes:ro"
      - "/var/log/k8s-audit:/var/log/k8s-audit:rw"
    kubelet-arg:
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - config=/etc/kubernetes/kubelet.yaml
      - exit-on-lock-contention=true
      - lock-file=/var/run/lock/kubelet.lock
      - pod-infra-container-image=registry.k8s.io/pause:3.1
      - register-node=true
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
      - v=4
    kube-controller-manager-arg:
      # Note: This will allow each node to have 256 assignable overlay addresses
      - node-cidr-mask-size=24
      - allocate-node-cidrs=true
      - attach-detach-reconcile-sync-period=1m0s
      # Note: Bind to all interfaces so that we can scrape the metrics.
      - bind-address=0.0.0.0
      - configure-cloud-routes=false
      # Note: Set custom feature gates that we have set in production
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - leader-elect=true
      - node-monitor-grace-period=2m
      - pod-eviction-timeout=220s
      - profiling=false
      - service-account-private-key-file=/etc/kubernetes/service-account.key
      - use-service-account-credentials=true
      - terminated-pod-gc-threshold=12500
      # Note: Set CIS 1.6 hardened recommended TLS cipher suites
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
    kube-controller-manager-extra-mount:
      - "/etc/kubernetes:/etc/kubernetes:ro"
    kube-proxy-arg:
      - conntrack-max-per-core=131072
      - conntrack-tcp-timeout-close-wait=0s
      - metrics-bind-address=0.0.0.0
      - proxy-mode=iptables
    kube-scheduler-arg:
      - bind-address=0.0.0.0
      - secure-port=10259
      - profiling=false
      - leader-elect=true
      - v=2
      # Note: Set CIS 1.6 hardened recommended TLS cipher suites
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
    disable:
      - rke2-ingress-nginx
      - rke2-metrics-server
      - rke2-canal
    EOF

    upgrade_strategy {
      control_plane_concurrency = "1"

      worker_concurrency = "10%"

      control_plane_drain_options {
        enabled                              = false
        force                                = false
        ignore_daemon_sets                   = true
        ignore_errors                        = false
        delete_empty_dir_data                = true
        disable_eviction                     = false
        grace_period                         = 0
        timeout                              = 10800
        skip_wait_for_delete_timeout_seconds = 600
      }
      worker_drain_options {
        enabled                              = false
        force                                = false
        ignore_daemon_sets                   = true
        ignore_errors                        = false
        delete_empty_dir_data                = true
        disable_eviction                     = false
        grace_period                         = 0
        timeout                              = 10800
        skip_wait_for_delete_timeout_seconds = 600
      }
    }
  }
}

riuvshyn on Jun 14, 2023

Rancher 2.7.2 and TF provider 1.25 - same here. Updated RKE, K3s clusters. RKE2 clusters destroyed and recreated.

davidhrbac on Apr 12, 2023

Reproduced on 2.7.1 I also noticed that issue is much easier to reproduce after performing rancher backup/restore might be it somehow related…

cc @Josh-Diamond @jakefhyde

riuvshyn on Feb 2, 2023

@jakefhyde I have an update on this one: I believe that this is rancher backup/restore operator causing this. Sometimes when I perform rancher restore operation I am hitting errors like that:

ERRO[2022/10/25 00:26:59] Error restoring resource mcc-rancher-euc1-te-test02-managed-system-upgrade-controller of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "mcc-rancher-euc1-te-test02-managed-system-upgrade-controller": the object has been modified; please apply your changes to the latest version and try again
ERRO[2022/10/25 00:27:05] Error restoring resource mcc-rancher-euc1-te-test03-managed-system-upgrade-controller of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "mcc-rancher-euc1-te-test03-managed-system-upgrade-controller": the object has been modified; please apply your changes to the latest version and try again
ERRO[2022/10/25 00:27:22] Error restoring namespaced resources [error restoring mcc-rancher-euc1-te-test02-managed-system-upgrade-controller of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "mcc-rancher-euc1-te-test02-managed-system-upgrade-controller": the object has been modified; please apply your changes to the latest version and try again error restoring mcc-rancher-euc1-te-test03-managed-system-upgrade-controller of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "mcc-rancher-euc1-te-test03-managed-system-upgrade-controller": the object has been modified; please apply your changes to the latest version and try again]

And that is happening I believe because rancher backup / restore operator is supposed to scale down rancher before doing actual restore and it is doing that

INFO[2022/10/25 00:29:23] Scaling down controllerRef apps/v1/deployments/rancher to 0

but it doesn’t wait for it to actually be fully stopped and starts restore right away and since termination is not instant and restore is already happening it corrupts the data somehow and after restore is complete with such errors then this bug can be reproduced changing just a label via terraform on a managed cluster cause cluster deletion.

So maybe there is nothing to do with terraform provider actually…

riuvshyn on Oct 25, 2022

@jakefhyde I think I finally figured out how to reproduce it… Recently on one of the rancher env setups I wasn’t able to reproduce this issue at all, which was very confusing because few days before It was definitely happening there. So then I’ve noticed that this cluster that hosts rancher was re-provisioned and it is in kind of “fresh” state. I’ve also checked that on previous iteration of that cluster were executed some tests for backup/restore rancher with backup-restore-operator. So on that “fresh” rancher setup I have provisioned rke2 cluster the same way it is described in this ticket, performed backup and restrore of rancher and then started simple test (modify cluster label and label in manifests defined in additional_manifest) again and my on 3rd iteration my cluster got to that state: all nodes are deleted except this one I guess it is stuck on some finalizer.

so steps to reproduce it:

provision rancher env
create custom rke2 cluster, no node-pool is defined nodes are joined using nodeCommand
perform rancher backup following official doc
perform rancher restore following official doc with prune: false
modify rancher2_cluster_v2 change cluster labels and anything in additional_manifest X times until issue is reproduced.

Additional notes: when that issue happens and I destroy broken cluster (it disappear from rancher UI) then re-provision the same cluster back everything looks normal but that issue is still can be reproduced with this cluster just by modifying cluster config via terraform provider.

I hope that will help you to reproduce this.

riuvshyn on Sep 30, 2022

This seems to be outdated information. RKE2 provisioning is GA since a long time… I believe with 2.6.3 it got GA. Just K3S seems to be tech preview, still.

Martin-Weiss on Sep 19, 2022