rancher: [BUG] Fleet trying to uninstall monitoring and longhorn after upgrade from 2.6.6 to 2.6.7

Rancher Server Setup

Rancher version: 2.6.6 / 2.6.7
Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2
Proxy/Cert Details:

Information about the Cluster

Kubernetes version: Local : v1.21.7+rke2r2 Downstream : v1.21.5+rke2r1 (imported RKE2)

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Admin

Describe the bug After upgrading Rancher from 2.6.6 to 2.6.7, which went without any problem on first glance, i noticed that somewhere around the time the cluster agent and fleet agent (not sure here) were updated my longhorn and monitoring apps were being uninstalled… But not completely, they got stuck somewhere along the line with all other services using longhorn volumes going down as well…

Longhorn and Monitoring are Rancher-Charts and installed with minimal customization

Git Repo has this layout longhorn-crd

defaultNamespace: rancher-logging
helm:
  releaseName: rancher-logging-crd
  repo: https://charts.rancher.io
  chart: rancher-logging-crd
  version: 100.0.1+up3.15.0

longhorn

defaultNamespace: longhorn-system
helm:
  repo: https://charts.rancher.io
  chart: longhorn
  releaseName: longhorn
  values:
    defaultSettings:
      createDefaultDiskLabeledNodes: true
      defaultDataPath: /longhorn-data/
      defaultReplicaCount: 2
      defaultDataLocality: best-effort
      replicaAutoBalance: best-effort
diff:
  comparePatches:
  - apiVersion: policy/v1beta1
    kind: PodSecurityPolicy
    operations:
    - {"op":"remove", "path":"/spec/hostIPC"}
    - {"op":"remove", "path":"/spec/hostNetwork"} 
  - apiVersion: v1
    kind: Service
    name: longhorn-frontend
    namespace: longhorn-system
    operations:
    - {"op":"remove", "path":"/spec/ports/0"}

monitoring-crd

defaultNamespace: longhorn-system
helm:
  repo: https://charts.rancher.io
  chart: longhorn-crd
  releaseName: longhorn-crd

monitoring

defaultNamespace: cattle-monitoring-system
helm:
  releaseName: rancher-monitoring
  repo: https://charts.rancher.io
  chart: rancher-monitoring
  version: 100.1.0+up19.0.3

diff:
  comparePatches:
  - apiVersion: admissionregistration.k8s.io/v1beta1
    kind: MutatingWebhookConfiguration
    name: rancher-monitoring-admission
    operations:
    - {"op":"remove", "path":"/webhooks"}
  - apiVersion: admissionregistration.k8s.io/v1beta1
    kind: ValidatingWebhookConfiguration
    name: rancher-monitoring-admission
    jsonPointers:
    - "/webhooks"
  - apiVersion: admissionregistration.k8s.io/v1
    kind: MutatingWebhookConfiguration
    name: rancher-monitoring-admission
    operations:
    - {"op":"remove", "path":"/webhooks"}
  - apiVersion: admissionregistration.k8s.io/v1
    kind: ValidatingWebhookConfiguration
    name: rancher-monitoring-admission
    jsonPointers:
    - "/webhooks"    
  - apiVersion: policy/v1beta1
    kind: PodSecurityPolicy
    operations:
    - {"op":"remove", "path":"/spec/hostIPC"}
    - {"op":"remove", "path":"/spec/hostNetwork"}
    - {"op":"remove", "path":"/spec/hostPID"}
    - {"op":"remove", "path":"/spec/privileged"}
    - {"op":"remove", "path":"/spec/readOnlyRootFilesystem"}
  - apiVersion: apps/v1
    kind: Deployment
    name: rancher-monitoring-grafana
    namespace: cattle-monitoring-system
    operations:
    - {"op":"remove", "path":"/spec/template/spec/containers/0/env/0/value"}
  - apiVersion: apps/v1
    kind: Deployment
    operations:
    - {"op":"remove", "path":"/spec/template/spec/hostNetwork"}
    - {"op":"remove", "path":"/spec/template/spec/nodeSelector"}
    - {"op":"remove", "path":"/spec/template/spec/priorityClassName"}
    - {"op":"remove", "path":"/spec/template/spec/tolerations"}
  - apiVersion: v1
    kind: ServiceAccount
    operations:
    - {"op":"remove", "path":"/imagePullSecrets"}

These configs are more or less taken from these examples : https://github.com/ibrokethecloud/core-bundles

Not sure if this is a problem due to the update or a misconfig on my end or a problem with fleet…

This really sucks ! The cleanup/aftermath is no fun… as Istio and Cert-Manager are depending on Monitoring and all the other stuff on Longhorn… 😦

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (2 by maintainers)

Most upvoted comments

Any chance to release 2.6.7-patch1 including this fix?

xhejtman on Aug 27, 2022

PASS Verified Fixed 2.6.8-rc1

rancher upgrade from 2.6.6 - 2.6.8-rc1

rancher updated to 2.6.8-rc1
fleet pods restarted in local cluster
downstream cluster apps are present and running
fleet pods restarted on downstream cluster
new fleet image v0.3.11-rc1 pulled, container created

local cluster: 2022-08-27_14-05-23

downstream cluster: 2022-08-27_14-02-59

ronhorton on Aug 27, 2022

This is the third time i had major problems with fleet… fun stuff 😦 Far from production ready it seems. And little to no feedback on the issues 😦

And on first glance all e2e tests only have a single paths entry

erSitzt on Aug 22, 2022