rancher: [BUG] Fleet trying to uninstall monitoring and longhorn after upgrade from 2.6.6 to 2.6.7

Rancher Server Setup

  • Rancher version: 2.6.6 / 2.6.7
  • Installation option (Docker install/Helm Chart):
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: Local : v1.21.7+rke2r2 Downstream : v1.21.5+rke2r1 (imported RKE2)

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Admin

Describe the bug After upgrading Rancher from 2.6.6 to 2.6.7, which went without any problem on first glance, i noticed that somewhere around the time the cluster agent and fleet agent (not sure here) were updated my longhorn and monitoring apps were being uninstalled… But not completely, they got stuck somewhere along the line with all other services using longhorn volumes going down as well…

Longhorn and Monitoring are Rancher-Charts and installed with minimal customization

Git Repo has this layout image longhorn-crd

defaultNamespace: rancher-logging
helm:
  releaseName: rancher-logging-crd
  repo: https://charts.rancher.io
  chart: rancher-logging-crd
  version: 100.0.1+up3.15.0

longhorn

defaultNamespace: longhorn-system
helm:
  repo: https://charts.rancher.io
  chart: longhorn
  releaseName: longhorn
  values:
    defaultSettings:
      createDefaultDiskLabeledNodes: true
      defaultDataPath: /longhorn-data/
      defaultReplicaCount: 2
      defaultDataLocality: best-effort
      replicaAutoBalance: best-effort
diff:
  comparePatches:
  - apiVersion: policy/v1beta1
    kind: PodSecurityPolicy
    operations:
    - {"op":"remove", "path":"/spec/hostIPC"}
    - {"op":"remove", "path":"/spec/hostNetwork"} 
  - apiVersion: v1
    kind: Service
    name: longhorn-frontend
    namespace: longhorn-system
    operations:
    - {"op":"remove", "path":"/spec/ports/0"}   

monitoring-crd

defaultNamespace: longhorn-system
helm:
  repo: https://charts.rancher.io
  chart: longhorn-crd
  releaseName: longhorn-crd

monitoring

defaultNamespace: cattle-monitoring-system
helm:
  releaseName: rancher-monitoring
  repo: https://charts.rancher.io
  chart: rancher-monitoring
  version: 100.1.0+up19.0.3

diff:
  comparePatches:
  - apiVersion: admissionregistration.k8s.io/v1beta1
    kind: MutatingWebhookConfiguration
    name: rancher-monitoring-admission
    operations:
    - {"op":"remove", "path":"/webhooks"}
  - apiVersion: admissionregistration.k8s.io/v1beta1
    kind: ValidatingWebhookConfiguration
    name: rancher-monitoring-admission
    jsonPointers:
    - "/webhooks"
  - apiVersion: admissionregistration.k8s.io/v1
    kind: MutatingWebhookConfiguration
    name: rancher-monitoring-admission
    operations:
    - {"op":"remove", "path":"/webhooks"}
  - apiVersion: admissionregistration.k8s.io/v1
    kind: ValidatingWebhookConfiguration
    name: rancher-monitoring-admission
    jsonPointers:
    - "/webhooks"    
  - apiVersion: policy/v1beta1
    kind: PodSecurityPolicy
    operations:
    - {"op":"remove", "path":"/spec/hostIPC"}
    - {"op":"remove", "path":"/spec/hostNetwork"}
    - {"op":"remove", "path":"/spec/hostPID"}
    - {"op":"remove", "path":"/spec/privileged"}
    - {"op":"remove", "path":"/spec/readOnlyRootFilesystem"}
  - apiVersion: apps/v1
    kind: Deployment
    name: rancher-monitoring-grafana
    namespace: cattle-monitoring-system
    operations:
    - {"op":"remove", "path":"/spec/template/spec/containers/0/env/0/value"}
  - apiVersion: apps/v1
    kind: Deployment
    operations:
    - {"op":"remove", "path":"/spec/template/spec/hostNetwork"}
    - {"op":"remove", "path":"/spec/template/spec/nodeSelector"}
    - {"op":"remove", "path":"/spec/template/spec/priorityClassName"}
    - {"op":"remove", "path":"/spec/template/spec/tolerations"}
  - apiVersion: v1
    kind: ServiceAccount
    operations:
    - {"op":"remove", "path":"/imagePullSecrets"}

These configs are more or less taken from these examples : https://github.com/ibrokethecloud/core-bundles

Not sure if this is a problem due to the update or a misconfig on my end or a problem with fleet…

This really sucks ! The cleanup/aftermath is no fun… as Istio and Cert-Manager are depending on Monitoring and all the other stuff on Longhorn… 😦

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (2 by maintainers)

Most upvoted comments

Any chance to release 2.6.7-patch1 including this fix?

PASS Verified Fixed 2.6.8-rc1

rancher upgrade from 2.6.6 - 2.6.8-rc1

  1. rancher updated to 2.6.8-rc1
  2. fleet pods restarted in local cluster
  3. downstream cluster apps are present and running
  4. fleet pods restarted on downstream cluster
  5. new fleet image v0.3.11-rc1 pulled, container created

local cluster: 2022-08-27_14-05-23

downstream cluster: 2022-08-27_14-02-59

This is the third time i had major problems with fleet… fun stuff 😦 Far from production ready it seems. And little to no feedback on the issues 😦

And on first glance all e2e tests only have a single paths entry