rancher: [BUG] Fleet trying to uninstall monitoring and longhorn after upgrade from 2.6.6 to 2.6.7
Rancher Server Setup
- Rancher version: 2.6.6 / 2.6.7
- Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2
- Proxy/Cert Details:
Information about the Cluster
- Kubernetes version: Local : v1.21.7+rke2r2 Downstream : v1.21.5+rke2r1 (imported RKE2)
User Information
- What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Admin
Describe the bug After upgrading Rancher from 2.6.6 to 2.6.7, which went without any problem on first glance, i noticed that somewhere around the time the cluster agent and fleet agent (not sure here) were updated my longhorn and monitoring apps were being uninstalled… But not completely, they got stuck somewhere along the line with all other services using longhorn volumes going down as well…
Longhorn and Monitoring are Rancher-Charts and installed with minimal customization
Git Repo has this layout
longhorn-crd
defaultNamespace: rancher-logging
helm:
releaseName: rancher-logging-crd
repo: https://charts.rancher.io
chart: rancher-logging-crd
version: 100.0.1+up3.15.0
longhorn
defaultNamespace: longhorn-system
helm:
repo: https://charts.rancher.io
chart: longhorn
releaseName: longhorn
values:
defaultSettings:
createDefaultDiskLabeledNodes: true
defaultDataPath: /longhorn-data/
defaultReplicaCount: 2
defaultDataLocality: best-effort
replicaAutoBalance: best-effort
diff:
comparePatches:
- apiVersion: policy/v1beta1
kind: PodSecurityPolicy
operations:
- {"op":"remove", "path":"/spec/hostIPC"}
- {"op":"remove", "path":"/spec/hostNetwork"}
- apiVersion: v1
kind: Service
name: longhorn-frontend
namespace: longhorn-system
operations:
- {"op":"remove", "path":"/spec/ports/0"}
monitoring-crd
defaultNamespace: longhorn-system
helm:
repo: https://charts.rancher.io
chart: longhorn-crd
releaseName: longhorn-crd
monitoring
defaultNamespace: cattle-monitoring-system
helm:
releaseName: rancher-monitoring
repo: https://charts.rancher.io
chart: rancher-monitoring
version: 100.1.0+up19.0.3
diff:
comparePatches:
- apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
name: rancher-monitoring-admission
operations:
- {"op":"remove", "path":"/webhooks"}
- apiVersion: admissionregistration.k8s.io/v1beta1
kind: ValidatingWebhookConfiguration
name: rancher-monitoring-admission
jsonPointers:
- "/webhooks"
- apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
name: rancher-monitoring-admission
operations:
- {"op":"remove", "path":"/webhooks"}
- apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
name: rancher-monitoring-admission
jsonPointers:
- "/webhooks"
- apiVersion: policy/v1beta1
kind: PodSecurityPolicy
operations:
- {"op":"remove", "path":"/spec/hostIPC"}
- {"op":"remove", "path":"/spec/hostNetwork"}
- {"op":"remove", "path":"/spec/hostPID"}
- {"op":"remove", "path":"/spec/privileged"}
- {"op":"remove", "path":"/spec/readOnlyRootFilesystem"}
- apiVersion: apps/v1
kind: Deployment
name: rancher-monitoring-grafana
namespace: cattle-monitoring-system
operations:
- {"op":"remove", "path":"/spec/template/spec/containers/0/env/0/value"}
- apiVersion: apps/v1
kind: Deployment
operations:
- {"op":"remove", "path":"/spec/template/spec/hostNetwork"}
- {"op":"remove", "path":"/spec/template/spec/nodeSelector"}
- {"op":"remove", "path":"/spec/template/spec/priorityClassName"}
- {"op":"remove", "path":"/spec/template/spec/tolerations"}
- apiVersion: v1
kind: ServiceAccount
operations:
- {"op":"remove", "path":"/imagePullSecrets"}
These configs are more or less taken from these examples : https://github.com/ibrokethecloud/core-bundles
Not sure if this is a problem due to the update or a misconfig on my end or a problem with fleet…
This really sucks ! The cleanup/aftermath is no fun… as Istio and Cert-Manager are depending on Monitoring and all the other stuff on Longhorn… 😦
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (2 by maintainers)
Any chance to release 2.6.7-patch1 including this fix?
PASS Verified Fixed
2.6.8-rc1
rancher upgrade from
2.6.6 - 2.6.8-rc1
v0.3.11-rc1
pulled, container createdlocal cluster:
downstream cluster:
This is the third time i had major problems with fleet… fun stuff 😦 Far from production ready it seems. And little to no feedback on the issues 😦
And on first glance all e2e tests only have a single
paths
entry