rancher: kubernetes 1.22 conflicts with system-library-rancher-monitoring, cluster is in error state

Rancher Server Setup

  • Rancher version: v2.6.3-patch2
  • Installation option (Docker install/Helm Chart): Docker

Information about the Cluster

  • Kubernetes version: v1.22.6-gke.300
  • Cluster Type (Local/Downstream): Hosted, gke

User Information

  • What is the role of the user logged in? Admin

Describe the bug my 1.22.6-gke.300 cluster is in error state since upgrading from v2.5 to rancher v2.6, stating Template system-library-rancher-monitoring incompatible with rancher version or cluster's [c-ab123] kubernetes version legacy monitoring is disabled on all projects and the cluster, alerts, alert groups and notifiers are all deleted, cattle-prometheus namespace is deleted. system-library chart is on branch release-v2.6

Rancher logs show [ERROR] error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster c-ab123 system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster's [c-ab123] kubernetes version, requeuing

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 31
  • Comments: 85 (4 by maintainers)

Commits related to this issue

Most upvoted comments

We had the same problem and found a workaround until someone at rancher will hopefully fix this:

In the local cluster where rancher is running:

  1. Make a backup of catalogtemplates/system-library-rancher-monitoring ressource kubectl get catalogtemplates system-library-rancher-monitoring -n cattle-global-data -o yaml > system-library-rancher-monitoring.yaml
  2. Edit the catalogtemplates/system-library-rancher-monitoring ressource kubectl edit catalogtemplates system-library-rancher-monitoring -n cattle-global-data

In the first list item under “spec.versions” edit the “kubeVersion: < 1.22.0-0” to something that matches your kubernetes version. We have set “kubeVersion: ‘>=1.21.0-0’”

Before

apiVersion: management.cattle.io/v3
kind: CatalogTemplate
metadata:
  creationTimestamp: "2021-08-05T07:15:46Z"
  generation: 4
  labels:
    catalog.cattle.io/name: system-library
  name: system-library-rancher-monitoring
  namespace: cattle-global-data
  resourceVersion: "183697747"
  uid: b68405eb-3973-4baa-b887-6f973ad3dc61
spec:
  catalogId: system-library
  defaultVersion: 0.3.2
  description: Provides monitoring for Kubernetes which is maintained by Rancher 2.
  displayName: rancher-monitoring
  folderName: rancher-monitoring
  icon: https://coreos.com/sites/default/files/inline-images/Overview-prometheus_0.png
  projectURL: https://github.com/coreos/prometheus-operator
  versions:
  - digest: 08fbaee28d5a0efb79db02d9372629e2
    externalId: catalog://?catalog=system-library&template=rancher-monitoring&version=0.3.2
    kubeVersion: < 1.22.0-0
    rancherMinVersion: 2.6.1-alpha1
    version: 0.3.2
    versionDir: charts/rancher-monitoring/v0.3.2
    versionName: rancher-monitoring
...

After

apiVersion: management.cattle.io/v3
kind: CatalogTemplate
metadata:
  creationTimestamp: "2021-08-05T07:15:46Z"
  generation: 4
  labels:
    catalog.cattle.io/name: system-library
  name: system-library-rancher-monitoring
  namespace: cattle-global-data
  resourceVersion: "183697747"
  uid: b68405eb-3973-4baa-b887-6f973ad3dc61
spec:
  catalogId: system-library
  defaultVersion: 0.3.2
  description: Provides monitoring for Kubernetes which is maintained by Rancher 2.
  displayName: rancher-monitoring
  folderName: rancher-monitoring
  icon: https://coreos.com/sites/default/files/inline-images/Overview-prometheus_0.png
  projectURL: https://github.com/coreos/prometheus-operator
  versions:
  - digest: 08fbaee28d5a0efb79db02d9372629e2
    externalId: catalog://?catalog=system-library&template=rancher-monitoring&version=0.3.2
    kubeVersion: '>=1.21.0-0'
    rancherMinVersion: 2.6.1-alpha1
    version: 0.3.2
    versionDir: charts/rancher-monitoring/v0.3.2
    versionName: rancher-monitoring
...

After editing this ressource the error messages have stopped immediately.

To fix the issue with the RKE cluster, follow these steps:

Use kubectl to edit the cluster configuration:

kubectl edit clusters.management.cattle <cluster_id>

Replace <cluster_id> with the ID of the cluster you want to edit.

Find the section that contains the error message:

- lastUpdateTime: "2023-03-28T03:29:01Z"
  message: template system-library-rancher-monitoring incompatible with rancher
    version or cluster's [c-qmh8k] kubernetes version
  reason: Error
  status: "False"

Replace it with the following section:

- lastUpdateTime: "2023-03-28T03:29:01Z"
  status: "True"
  type: PrometheusOperatorDeployed

Save the changes and exit the editor.

This should not have been closed. One of the problems is that a message is being logged over and over about something that we don’t care about, precisely because it isn’t supported. It’s still happening on my Rancher 2.6.9 instance. This started its life on 2.3.x and never had monitoring V1 enabled. The other problem has a separate ticket as noted above.

I was able to fix my same issue by disabling the legacy feature and restarting rancher. The option is located in Global settings -> Feature flags. After disabling it, the system-library-rancher-monitoring errors continue to populate the logs until Rancher is restarted.

Have exactly the same problem with upgrading docker version of rancher from 2.5.12 to 2.6.4 2022/04/07 07:06:01 [ERROR] error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster local system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster's [local] kubernetes version, requeuing W0407 07:06:53.762722 56 transport.go:288] Unable to cancel request for *client.addQuery 2022/04/07 07:08:02 [ERROR] error syncing 'system-library': handler system-image-upgrade-catalog-controller: upgrade cluster local system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster's [local] kubernetes version, requeuing

things are unchanged on v2.6.4

Dear @MKlimuszka please consider reopening the case since the original issue reporting clearly claims that legacy monitoring is disabled on all projects on the cluster, alerts, alert groups and notifiers are all deleted so the issue is not that Monitoring and logging v1 is not supported any more above 1.21, but that although they are all turned off an error is still logged on dashboard image

How exactly do you clear the error in the UI after it has triggered? All the solutions I see so far only disables the error being logged. With the error triggered, the cluster is not actually selectable even though if you paste in the ID into the URL you can still access the Cluster Explorer. The cluster in question is not even using the legacy Prometheus monitoring. Running Rancher 2.7.1 and this cluster was upgraded to 1.24.10.

image

UPDATE I was able to clear the error by editing the cluster.

kubectl edit clusters.management.cattle.io c-z7kd2

Find the block for PrometheusOperatorDeployed. It will be in Error state. You will need to replace this block with something like this to reset the error in the UI.

  - lastUpdateTime: "2023-04-09T09:05:55Z"
    status: "True"
    type: PrometheusOperatorDeployed

hit the same issue and since we don’t need the legacy catalog feature, so I simply removed it, and no errors any more

kubectl get catalogs system-library -o yaml > system-library.yaml
kubectl delete -f system-library.yaml
kubectl rollout restart -n cattle-system deploy/rancher

same issue in 2.7 also

@onpaws we are using single instance docker image

I was just hit by this bug. When I attempted the fixes documented in the issue, I was having a hard time locating the right resource to edit, until I realized I needed to perform the kubectl edit clusters.management.cattle.io <cluster-id> from the “local” cluster. I hope this helps someone else who is just a newbie with rancher as well.

thanks @chri4774 for the tip, indeed logs are gone with this tweak, but I still have this error state on my 1.22 cluster, could you get this gone as well? image