rancher: [Backport v2.6] [BUG] Deleted cluster not removed from UI view until hard refresh (Ghost cluster)

Rancher 2.6.1

Login as standard, non-admin user
Create a RKE2 cluster
In a second browser login as admin user and navigate to Cluster Manager view
As the non-admin user delete the cluster

Expected:

When the cluster is deleted it is removed from the view.

Actual:

Normal user session: Deleted cluster is not removed from the view until performing a hard browser refresh. Admin user session: Deleted cluster is removed from the view immediately

https://user-images.githubusercontent.com/3813921/137932366-12f9f499-0e30-4c17-86b5-b361ba95c5a5.mov

2021/10/19 15:03:11 [INFO] rkecluster fleet-default/template-rke2: waiting for at least one bootstrap node
2021/10/19 15:03:11 [INFO] [mgmt-auth-crtb-controller] Deleting roleBinding crb-n4zbcfymqv
2021/10/19 15:03:11 [INFO] [mgmt-auth-crtb-controller] Deleting rolebinding creator-cluster-owner-cluster-owner in namespace p-l9xs6 for crtb creator-cluster-owner
2021/10/19 15:03:11 [INFO] [mgmt-auth-crtb-controller] Deleting rolebinding creator-cluster-owner-cluster-owner in namespace p-hhc6q for crtb creator-cluster-owner
2021/10/19 15:03:12 [INFO] [mgmt-project-rbac-remove] Deleting namespace p-l9xs6
2021/10/19 15:03:12 [INFO] [mgmt-project-rbac-remove] Deleting namespace p-hhc6q
2021/10/19 15:03:16 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:16 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:16 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:16 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:16 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:16 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:17 [ERROR] error syncing 'fleet-default/template-rke2-dsfsdfsdfsdf-7ccf7c9f-98fgb': handler machine-provision-remove: cannot delete machine template-rke2-dsfsdfsdfsdf-7ccf7c9f-98fgb because create job has not finished, requeuing
2021/10/19 15:03:17 [INFO] rkecluster fleet-default/template-rke2: waiting for at least one bootstrap node
2021/10/19 15:03:17 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:17 [INFO] [mgmt-auth-prtb-controller] Updating owner label for roleBinding crb-mrk3zsur4y
2021/10/19 15:03:17 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:18 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: clusterregistrationtokens.management.cattle.io "default-token" is forbidden: unable to create new content in namespace c-m-v8p56lnd because it is being terminated, requeuing
2021/10/19 15:03:18 [INFO] [mgmt-auth-prtb-controller] Deleting roleBinding crb-mrk3zsur4y
2021/10/19 15:03:19 [INFO] [mgmt-cluster-rbac-delete] Creating namespace c-m-v8p56lnd
2021/10/19 15:03:19 [ERROR] error syncing 'c-m-v8p56lnd': handler cluster-watch: namespaces "c-m-v8p56lnd" not found, requeuing
2021/10/19 15:03:20 [ERROR] error syncing 'harvey': handler machine-worker-label: machines.cluster.x-k8s.io "custom-599f50dd418f" not found, requeuing
```

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 27 (23 by maintainers)

Most upvoted comments

Validation Template

This template is for the second version of the fix. Information here supersedes what was in the the previous validation template.

Root Cause

Users are assigned to particular roles/roleBindings (for provisioning clusters) and clusterRoles/clusterRoleBindings(for management clusters) which give them permission to view the cluster object in the local cluster. These roles are owned by the cluster object (in part) and are deleted when the cluster is deleted. In the current version of rancher, this results in these objects being removed before the cluster is completely gone. This causes users to not receive (through the websocket) the delete events for the cluster, which results in the cluster remaining on the UI. Admin users, which have * verbs on * resources in * groups in the local cluster, do not suffer from these issues, resulting in the discrepancy pointed out in the issue.

The root cause of the previous fix being insufficient for RKE2 clusters was that v2 provisioning clusters use a different CRD as their primary cluster object (clusters.provisioning.cattle.io). These cluster types were not checked and the roles which gave access to these still experienced the issues outlined above.

What was fixed, or what change have occurred

Logic has been added which blocks roles, role bindings, cluster roles, and cluster rolebindings which grant access to a cluster from being deleted as long as the cluster is still in a deleting state (i.e. can be retrieved from k8s and has a non-nil deletion timestamp).

The previous version of this change only considered mgmt type clusters (clusters.management.cattle.io). This new change considers provisioning type clusters (clusters.provisioning.cattle.io) and blocks deletion of the RBAC objects outlined above so long as clusters of either type are in a deleting state.

Areas or cases that should be tested

Notes

The functionality below relies on the websocket remaining active throughout the cluster deletion process. If the websocket is interrupted during deletion, the cluster will remain on the UI. This issue is not related to our RBAC or this fix, and any attempt to fix our websocket stability will need to occur during a separate PR.
The changes made in this PR affect the ability of roles, cluster roles, role bindings, and cluster role bindings, to delete. In every test case, it should be validated that the RBAC objects were able to be deleted and that we didn’t end up creating new “orphaned” objects. See the section on regressions for more details.

Test Cases

The original issue with RKE1 and RKE2 clusters. These steps should be re-run with a new user that has a variety of permissions in the downstream cluster (both cluster level like cluster-owner, cluster-member, view nodes, etc. and project level like project owner, manage project members, etc.). General Steps:

Run rancher
Create a new user with standard user global role
Create a new cluster, as admin
Add the user to the cluster as a cluster owner
Delete the cluster. Observe the UI to ensure that the cluster is removed from the window of the standard user and the admin user.
Verify that no RBAC resources exist which give permission to view this cluster in the local cluster (i.e. no roles/clusterroles which grant any verbs on specifically this cluster, and no rolebindings/clusterrolebindings to such roles/cluster roles or to roles/cluster roles which would have given such access but now no longer exist)

Removal of a user from a cluster on RKE1/RKE2 clusters. Steps:

Run rancher
Create a new user with standard user global role
Create a new cluster, as admin
Add the user to the cluster as a cluster owner
Remove the user from the cluster owner role
Validate that the user cannot see the cluster any longer
Validate that no RBAC resources exist in the local cluster which give users access to the management or provisioning cluster objects for this cluster.

Removal of a user from a project on RKE1/RKE2 clusters. Steps:

Run rancher
Create a new user with standard user global role
Create a new cluster, as admin
In the cluster, create a new project
Add the new user as a project owner to this cluster
Validate that the new user can see the cluster in the UI
Remove the user from the project owner role
Validate that the user cannot see the cluster any longer
Validate that no RBAC resources exist in the local cluster which give users access to the management or provisioning cluster objects for this cluster.

Local cluster basic RBAC test. Steps:

Run rancher.
In the local cluster, create a role
In the local cluster, bind a user to that role through a role binding
Delete the role binding through kubectl
Delete the role though kubectl
Verify that the role was deleted
Verify that the role binding was deleted
In the local cluster, create a cluster role
In the local cluster, bind a user to that cluster role though a cluster role binding
In the local cluster, bind a user to that cluster role through a role binding
Delete the role binding through kubectl
Delete the cluster role binding through kubectl
Delete the cluster role though kubectl
Verify that the role binding was deleted
Verify that the cluster role binding was deleted
Verify that the cluster role was deleted

What areas could experience regressions

The ability to remove users from cluster level permissions. Since this code results in new finalizers on our RBAC objects, this could result in rolebindings/clusterrolebindings not being deleted. This could result in more permissions for users than was desired. Note that since this effects the RBAC primitives, not our ClusterRoleTemplateBindings or ProjectRoleTemplateBindings, you will need to verify that the RBAC primitives are deleted. It is not enough to verify that the CRTB or PRTB is deleted. In addition, keep in mind that some permissions are granted to users which have roles in a project in a cluster.
The ability to delete RBAC objects in the local cluster. This is related to the first bullet point, but lower level. The way that the solution is structured results in a new finalizer on every Role, Cluster Role, Role Binding, or Cluster Role Binding in the local cluster. The code was configured, essentially, to ignore resources which do not have a cluster.cattle.io/name annotation (and a cluster.cattle.io/namespace in the case of some RBAC resources). But it’s possible that there exists a bug which causes the finalizer to be overly active, so tests should be conducted to ensure basic create/delete of the above resources continues to work without issue in the local cluster.

Are the repro steps accurate/minimal?

Yes, they are included here for convenience.

Run rancher v2.6.8
Create a new user with standard user global role
Create a new cluster, as admin
Add the user to the cluster as a cluster owner
Open a new window where you can see the existing cluster as the standard user (in Cluster Manager). Place that side by side with the admin user (also in Cluster Manager).
Initiate a delete of the cluster
Note that the admin user no longer sees the cluster, but a non-admin user continues to see the cluster (note: this should not happen with the fix)

MbolotSuse on Sep 12, 2022