rancher: 2.4.3 upgrade monitoring to 0.1.0 fails

Just upgraded Rancher to 2.4.3 from 2.3.6. The clusters I have were previously running on 1.7.4 and I have upgraded them to 1.7.5. It indicated there is a monitoring upgrade to 0.1.0. I was previously running 0.7.0 without any issues. However, when proceeded with the upgrade, the prometheus-cluster-monitoring statefulset never comes up on and is constantly created then removed. It’s not quite clear what the issue is. All the sidecars in the pod comes up healthy except prometheus-agent. I’m running CentOS 7 VMs and Docker 19.3.1/19.3.8.

I am using persistent storage. Of the clusters I have that I tried performing the upgrade, one is using Rook/Ceph and one is on AWS using EBS. Both exhibit the same exact problem. However, if I turn off persistent storage it comes online fine.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 6
  • Comments: 37 (17 by maintainers)

Most upvoted comments

Reproduce Step

It’s really easy to reproduce in Rancher v2.4.3 with k8s v1.17.5 and monitoring v0.1.0. Just enable cluster monitoring with PV enabled.

Reason

Tried Rancher monitoring 0.0.7, same problem.

Here is the change between v0.0.7 and v0.1.0 https://github.com/rancher/system-charts/pull/204/commits/32907f0b63f6c30f1574247696525d7c77d3c864

The change is small and looks good.

It looks like it’s related to Rancher version and Prometheus Operator version.

As we introduced https://github.com/rancher/rancher/commit/ac7471754e6301262dfc2a696195d5bd099cbf8f in 2.4, it will cause a bug in Prometheus Operator which has been fixed after 0.32.0 https://github.com/coreos/prometheus-operator/commit/8b8a2470726df00f1e8042ac5b16161a81261e1b

Workaround

Option 1:

Enable cluster monitoring without PV enabled

Option 2:

  • Disable cluster monitoring first.
  • Edit system-library to use https://github.com/loganhz/system-charts with branch fix.
  • Enable monitoring again.

image

Related fix commit https://github.com/loganhz/system-charts/commit/a41ce9f9e65778efc5cb05669486173449b1ee6c

Option 3:

  • Click View Edit YAML of prometheus-operator-monitoring-operator
  • Delete line - --crd-apigroup=monitoring.coreos.com
  • Update image to quay.io/coreos/prometheus-operator:v0.38.1
  • Click Save

image

Solutions

Option 1:

Upgrade Prometheus Operator version

At least, we need to upgrade Prometheus Operator version.

Maybe we can bump component versions for Rancher monitoring such as prometheus, operator, exporter, configmap-reload and so on.

You can find the whole image list via https://github.com/rancher/system-charts/blob/release-v2.4/charts/rancher-monitoring/v0.1.0/values.yaml

We need to verify monitoring and alerting in many cases such as different K8s version, PV enabled nor not, upgrading monitoring from old monitoring version.

Option 2:

Revert https://github.com/rancher/rancher/commit/ac7471754e6301262dfc2a696195d5bd099cbf8f or a better fix for it.

There is no re-try logic in our catalog app controller. It means in some case, monitoring might not be installed successfully. So we introduced the fix to force our catalog app controller to re-install the cluster monitoring app https://github.com/rancher/rancher/issues/26440.

However, this fix cause the current issue. It will update the Prometheus CRD with a new storage.volumeClaimTemplate.metadata.creationTimestamp value when Rancher controller force deploy monitoring app. The field will be ignored by Prometheus Operator 0.38. But in Prometheus Operator 0.32, Prometheus Operator will remove and redeploy Prometheus StatefulSet… Then rancher controller will notice Prometheus StatefulSet is gone. It will force update cluster monitoring again. So Prometheus Operator will remove and redeploy it again… This just happens again and again. The force re-deploy logic should be refined. It’s so aggressive now

Option 3:

Maybe we can hard code creationTimestamp value for https://github.com/rancher/system-charts/blob/dev-v2.5/charts/rancher-monitoring/v0.1.0/charts/prometheus/templates/prometheus.yaml#L154 So it will never changed. It’s easy but not good as it will have a fixed creationTimestamp.

Thanks @jiaqiluo I am trying to enable monitoring without a pv as workaround since I don’t really need persistence I think. But I am hitting another issue now with monitoring because my cloud provider controller manager unsets the internal IPs of the nodes causing various issues with metrics etc…

Hi @vitobotta

the new monitoring chart 0.1.1 is available for rancher:v2.4.5

@jiaqiluo I’m on rancher 2.4.4 but I cannot see the monitoring version 0.1.1. Is there a way to force it?